Skip to main content
Log in

MatchSim: a novel similarity measure based on maximum neighborhood matching

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Measuring object similarity in a graph is a fundamental data- mining problem in various application domains, including Web linkage mining, social network analysis, information retrieval, and recommender systems. In this paper, we focus on the neighbor-based approach that is based on the intuition that “similar objects have similar neighbors” and propose a novel similarity measure called MatchSim. Our method recursively defines the similarity between two objects by the average similarity of the maximum-matched similar neighbor pairs between them. We show that MatchSim conforms to the basic intuition of similarity; therefore, it can overcome the counterintuitive contradiction in SimRank. Moreover, MatchSim can be viewed as an extension of the traditional neighbor-counting scheme by taking the similarities between neighbors into account, leading to higher flexibility. We present the MatchSim score computation process and prove its convergence. We also analyze its time and space complexity and suggest two accelerating techniques: (1) proposing a simple pruning strategy and (2) adopting an approximation algorithm for maximum matching computation. Experimental results on real-world datasets show that although our method is less efficient computationally, it outperforms classic methods in terms of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aliguliyev RM (2009) A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst Appl 36(4): 7764–7772

    Article  Google Scholar 

  2. Baeza-Yates RA, Ribeiro-Neto BA (1999) Modern information retrieval. ACM Press/Addison-Wesley, NY

    Google Scholar 

  3. Burkard R, Dell’Amico M, Martello S (2009) Assignment problems. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA

    Book  MATH  Google Scholar 

  4. Cunningham P (2009) A taxonomy of similarity mechanisms for case-based reasoning. IEEE Trans Knowl Data Eng 21(11): 1532–1543

    Article  Google Scholar 

  5. Dean J, Henzinger MR (1999) Finding related pages in the World Wide Web. Comput Netw (Amsterdam, Netherlands, 1994) 31(11–16): 1467–1479

    Google Scholar 

  6. Drake DE, Hougardy S (2003) A simple approximation algorithm for the weighted matching problem. Inf Process Lett 85(4): 211–213

    Article  MathSciNet  MATH  Google Scholar 

  7. Flake GW, Lawrence S, Giles CL, Coetzee FM (2002) Self-organization and identification of web communities. Computer 35(3): 66–71

    Article  Google Scholar 

  8. Fogaras D, Rácz B (2005) Scaling link-based similarity search. In: WWW ’05: proceedings of the 14th international conference on World Wide Web, ACM, New York, USA, pp. 641–650

  9. Formica A, Elaheh P (2010) Content based similarity of geographic classes organized as partition hierarchies. Knowl Inf Syst 20(2): 221–241

    Article  Google Scholar 

  10. Goldberger J, Gordon S, Greenspan H (2003) An efficient image similarity measure based on approximations of kl-divergence between two gaussian mixtures. In: ICCV’03: proceedings of the 9th IEEE international conference on computer vision, IEEE Computer Society, Washington, DC, USA, pp. 487

  11. Gueguen L, Datcu M (2008) A similarity metric for retrieval of compressed objects: application for mining satellite image time series. IEEE Trans Knowl Data Eng 20(4): 562–575

    Article  Google Scholar 

  12. Gupta A, Ying L (1999) On algorithms for finding maximum matchings in bipartite graphs. In: Technical report RC 21576 (97320), IBM T. J. Watson Research Center

  13. Gyöngyi Z, Molina HG (2005) Web spam taxonomy. In: First international workshop on adversarial information retrieval on the Web (AIRWeb 2005)’

  14. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc., NJ, USA

    MATH  Google Scholar 

  15. Jeh G, Widom J (2002) SimRank: a measure of structural-context similarity. In: KDD ’02: proceedings of the 8th ACM SIGKDD, ACM Press, NY, USA, pp. 538–543

  16. Katz L (1953) A new status index derived from sociometric analysis. Psychometrika 18(1): 39–43

    Article  MATH  Google Scholar 

  17. Kessler M (1963) Bibliographic coupling between scientific papers. Am Doc 14(10–25)

    Google Scholar 

  18. Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. JACM 46(5): 604–632

    Article  MathSciNet  MATH  Google Scholar 

  19. Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Quart 2: 83–97

    Article  MathSciNet  Google Scholar 

  20. Li Y, McLean D, Bandar ZA, O’Shea JD, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18(8): 1138–1150

    Article  Google Scholar 

  21. Lian X, Chen L (2008) Efficient similarity search over future stream time series. IEEE Trans Knowl Data Eng 20(1): 40–54

    Article  Google Scholar 

  22. Liben-Nowell D, Kleinberg J (2003) The link prediction problem for social networks. In: CIKM ’03: prodeedings of the 12th international conference on information and knowledge management, ACM, pp. 556–559

  23. Lin D (1998) An information-theoretic definition of similarity. In: ICML ’98: proceedings of the fifteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 296–304

  24. Lin Z, King I, Lyu MR (2006) PageSim: a novel link-based similarity measure for the World Wide Web. In: WI ’06: proceedings of the 5th international conference on web intelligence, IEEE Computer Society, Hong Kong, pp. 687–693

  25. Lin Z, Lyu MR, King I (2007) Extending link-based algorithms for similar web pages with neighborhood structure. In: WI ’07: proceedings of the 6th international conference on web intelligence, IEEE Computer Society, Washington, DC, USA, pp. 263–266

  26. Lu W, Janssen J, Milios E, Japkowicz N, Zhang Y (2006) Node similarity in the citation graph. Knowl Inf Syst 11(1): 105–129

    Article  Google Scholar 

  27. Maguitman AG, Menczer F, Roinestad H, Vespignani A (2005) Algorithmic detection of semantic similarity. In: WWW ’05: proceedings of the 14th international conference on World Wide Web, ACM, New York, NY, USA, pp. 107–116

  28. Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the Web, Technical report, Stanford Digital Library Technologies Project

  29. Ramos J (2003) Using TF-IDF to determine word relevance in document queries, Technical report, Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855e

  30. Salton G (1989) Automatic Text Processing. Addison-Wesley, MA

    Google Scholar 

  31. Salton G, Buckley C (1987) Term weighting approaches in automatic text retrieval, Technical report, Ithaca, NY, USA

  32. Sen P, Namata GM, Bilgic M, Getoor L, Gallagher B, Eliassi-Rad T (2008) Collective classification in network data. AI Magazine 29(3): 93–106

    Google Scholar 

  33. Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24(265–269)

    Google Scholar 

  34. Sugiyama K, Hatano K, Yoshikawa M, Uemura S (2003) Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages. In: HYPERTEXT ’03: proceedings of the 14th ACM conference on Hypertext and hypermedia, ACM, NY, USA, pp. 198–207

  35. Sugiyama K, Hatano K, Yoshikawa M, Uemura S (2005) Improvement in TF-IDF scheme for web pages based on the contents of their hyperlinked neighboring pages. Syst Comput Japan 36(14): 56–68

    Article  Google Scholar 

  36. van Rijsbergen CJ (1979) Information Retrieval. Butterworth-Heinemann

  37. Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inf Syst 15(1): 55–73

    Article  Google Scholar 

  38. Wang H (2006) Nearest neighbors by neighborhood counting. IEEE Trans Pattern Anal Mach Intell 28(6): 942–953

    Article  Google Scholar 

  39. Wang H, Murtagh F (2008) A study of the neighborhood counting similarity. IEEE Trans Knowl Data Eng 20(4): 449–461

    Article  Google Scholar 

  40. Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2): 245–259

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenjiang Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, Z., Lyu, M.R. & King, I. MatchSim: a novel similarity measure based on maximum neighborhood matching. Knowl Inf Syst 32, 141–166 (2012). https://doi.org/10.1007/s10115-011-0427-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0427-z

Keywords

Navigation