Skip to main content
Log in

Meta similarity

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

To see if two given strings are matched, various string similarity metrics have been employed and these string similarities can be categorized into three classes: (a) Edit-distance-based similarities, (b) Token-based similarities, and (c) Hybrid similarities. In essence, since different types of string similarities have different pros and cons in measuring the similarity between two strings, string similarity metrics in each class are likely to work well for particular data sets. Toward this problem, we propose a novel Meta Similarity that both (i) outperforms the existing similarity metrics and (ii) is the least affected by a variety of data sets. Our claim is empirically validated through extensive experimental tests—our proposal shows an improvement to the largest 20% average recall, compared to the best case of the existing similarity metrics and our method is the most stable, showing from 0.95 to 1.0 average recall range in all the data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: VLDB

  2. arXiv.org e Print archive. http://arxiv.org/

  3. Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Int’l world wide web conf (WWW)

  4. Benjelloun O, Garcia-Molina H, Su Q, Widom J (2005) Swoosh: A generic approach to entity resolution. Technical report, Stanford University

  5. Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: ACM SIGMOD workshop on research issues in data mining and knowledge discovery

  6. Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name-matching in information integration. IEEE Intell Syst 18(5):16–23

    Article  Google Scholar 

  7. Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: ACM SIGMOD

  8. Cohen W, Ravikumar P, Fienberg S (2003) A comparison of string distance metrics for name-matching tasks. In: II Web workshop held in conjunction with IJCAI

  9. Cohen WW (2000) Data integration using similarity joins and a word-based information representation language. Inf Syst

  10. Digital bibliography and library project (DBLP). http://dblp.uni-trier.de/

  11. Dong X, Halevy AY, Madhavan J (2005) Reference reconciliation in complex information spaces. In: ACM SIGMOD

  12. Elfeky MG, Verykios VS, Elmagarmid AK (2002) TAILOR: A record linkage toolbox. In: IEEE ICDE, February 2002

  13. Elfeky MG, Verykios VS, Elmagarmid AK (2002) TAILOR: A record linkage toolbox. San Jose, California, Feb 2002

  14. Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: A survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Article  Google Scholar 

  15. Fellegi IP, Sunter AB (1964) A theory for record linkage. J Am Stat Assoc 1183–1210

  16. Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Soc 64:1183–1210

    Google Scholar 

  17. Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Assoc. Comput Linguist 19(1)

  18. Golub GH, van Loan CF (1999) Matrix computations. Johns Hopkins University Press, Baltimore

    Google Scholar 

  19. Gravano L, Ipeirotis P, Jagadish HV, Koudas N, Muthukrishnan S, Pietarinen L, Srivastava D (2001) Using q-grams in a DBMS for approximate string processing. IEEE Data Eng Bull 24(4)

  20. Han H, Zha H, Lee Giles C (2005) Name disambiguation in author citations using a K-way spectral clustering method. In: ACM/IEEE joint conf on digital libraries (JCDL), Jun 2005

  21. Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: ACM SIGMOD

  22. Hernandez MA, Stolf SJ (1998) Real-world data is dirty: Data cleansing and the merge/purge problem. J Data Min Knowl Discov

  23. Hong Y, On B-W, Lee D (2004) System support for name authority control problem in digital libraries: OpenDBLP approach. In: European conf on digital libraries, Bath, UK, September 2004

  24. Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 84(406):414–420

    Article  Google Scholar 

  25. Kalashnikov DV, Mehrotra S, Chen Z (2005) Exploiting relationships for domain-independent data cleaning. In: SIAM data mining (SDM) conf

  26. Lawrence S, Giles CL, Bollacker K (1999) Digital libraries and autonomous citation indexing. IEEE Comput 32(6):67–71

    Google Scholar 

  27. Lee D, Kang J, Mitra P, Giles C, On B (2006) Are your citations clean? New challenges and scenarios in maintaining digital libraries. In: Communications of the ACM

  28. Lee D, On B-W, Kang J, Park S (2005) Effective and scalable solutions for mixed and split citation problems in digital libraries. In: ACM SIGMOD workshop on information quality in information systems (IQIS), June 2005

  29. Ley M (2002) The DBLP computer science bibliography: Evolution, research issues, perspectives. In: Int’l symp on string processing and information retrieval (SPIRE), Lisbon, Portugal, Sep 2002

  30. CiteSeer: Scientific literature digital library. http://www.citeseer.org/

  31. Li X, Ma B, Li M, Chen X, Vitanyi M (2004) The similarity metric. IEEE Tran Inf Theory

  32. Malin B (2005) Unsupervised name disambiguation via social network similarity. In: SIAM SDM workshop on link analysis, counterterrorism and security

  33. McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM KDD, pp 169–178

  34. Monge AE, Elkan C (1996) The field matching problem: Algorithms and applications, 23–29

  35. Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. J Data Min Knowl Discov

  36. Monge AE, Elkan CP (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records, 23–29

  37. RIDDLE: Repository of information on duplicate detection record linkage and identity uncertainty. http://www.cs.utexas.edu/users/ml/riddle/data.html

  38. On B-W, Lee D, Kang J, Mitra P (2005) Comparative study of name disambiguation problem using a scalable blocking-based framework. In: ACM/IEEE joint conf on digital libraries, June 2005

  39. On B-W, Koudas N, Lee D, Srivastava D (2007) Group linkage

  40. Pasula H, Marthi B, Milch B, Russell S, Shpitser I (2003) Identity uncertainty and citation matching. In: Advances in neural information processing systems. MIT Press, Cambridge

    Google Scholar 

  41. Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning, 269–278

  42. Shen W, Li X, Doan A (2005) Constraint-based entity matching. In: AAAI

  43. SiMETRICS. “ ”. http://www.dcs.shef.ac.uk/sam/stringmetrics.html

  44. SecondString: Open source Java-based package of approximate string-Matching. http://secondstring.sourceforge.net/

  45. The MathWorks Matlab function reference. http://www.mathworks.com/

  46. Verykios VS, Elmagarmid AK, Houstis EN (2000) Automating the approximate record matching process. Inf Sci 126(1–4):83–98

    Google Scholar 

  47. Warnner JW, Brown EW (2001) Automated name authority control. In: ACM/IEEE joint conf on digital libraries (JCDL)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Byung-Won On.

Rights and permissions

Reprints and permissions

About this article

Cite this article

On, BW., Lee, I. Meta similarity. Appl Intell 35, 359–374 (2011). https://doi.org/10.1007/s10489-010-0226-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-010-0226-3

Keywords

Navigation