Abstract
To see if two given strings are matched, various string similarity metrics have been employed and these string similarities can be categorized into three classes: (a) Edit-distance-based similarities, (b) Token-based similarities, and (c) Hybrid similarities. In essence, since different types of string similarities have different pros and cons in measuring the similarity between two strings, string similarity metrics in each class are likely to work well for particular data sets. Toward this problem, we propose a novel Meta Similarity that both (i) outperforms the existing similarity metrics and (ii) is the least affected by a variety of data sets. Our claim is empirically validated through extensive experimental tests—our proposal shows an improvement to the largest 20% average recall, compared to the best case of the existing similarity metrics and our method is the most stable, showing from 0.95 to 1.0 average recall range in all the data sets.
Similar content being viewed by others
References
Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: VLDB
arXiv.org e Print archive. http://arxiv.org/
Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Int’l world wide web conf (WWW)
Benjelloun O, Garcia-Molina H, Su Q, Widom J (2005) Swoosh: A generic approach to entity resolution. Technical report, Stanford University
Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: ACM SIGMOD workshop on research issues in data mining and knowledge discovery
Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name-matching in information integration. IEEE Intell Syst 18(5):16–23
Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: ACM SIGMOD
Cohen W, Ravikumar P, Fienberg S (2003) A comparison of string distance metrics for name-matching tasks. In: II Web workshop held in conjunction with IJCAI
Cohen WW (2000) Data integration using similarity joins and a word-based information representation language. Inf Syst
Digital bibliography and library project (DBLP). http://dblp.uni-trier.de/
Dong X, Halevy AY, Madhavan J (2005) Reference reconciliation in complex information spaces. In: ACM SIGMOD
Elfeky MG, Verykios VS, Elmagarmid AK (2002) TAILOR: A record linkage toolbox. In: IEEE ICDE, February 2002
Elfeky MG, Verykios VS, Elmagarmid AK (2002) TAILOR: A record linkage toolbox. San Jose, California, Feb 2002
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: A survey. IEEE Trans Knowl Data Eng 19(1):1–16
Fellegi IP, Sunter AB (1964) A theory for record linkage. J Am Stat Assoc 1183–1210
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Soc 64:1183–1210
Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Assoc. Comput Linguist 19(1)
Golub GH, van Loan CF (1999) Matrix computations. Johns Hopkins University Press, Baltimore
Gravano L, Ipeirotis P, Jagadish HV, Koudas N, Muthukrishnan S, Pietarinen L, Srivastava D (2001) Using q-grams in a DBMS for approximate string processing. IEEE Data Eng Bull 24(4)
Han H, Zha H, Lee Giles C (2005) Name disambiguation in author citations using a K-way spectral clustering method. In: ACM/IEEE joint conf on digital libraries (JCDL), Jun 2005
Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: ACM SIGMOD
Hernandez MA, Stolf SJ (1998) Real-world data is dirty: Data cleansing and the merge/purge problem. J Data Min Knowl Discov
Hong Y, On B-W, Lee D (2004) System support for name authority control problem in digital libraries: OpenDBLP approach. In: European conf on digital libraries, Bath, UK, September 2004
Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 84(406):414–420
Kalashnikov DV, Mehrotra S, Chen Z (2005) Exploiting relationships for domain-independent data cleaning. In: SIAM data mining (SDM) conf
Lawrence S, Giles CL, Bollacker K (1999) Digital libraries and autonomous citation indexing. IEEE Comput 32(6):67–71
Lee D, Kang J, Mitra P, Giles C, On B (2006) Are your citations clean? New challenges and scenarios in maintaining digital libraries. In: Communications of the ACM
Lee D, On B-W, Kang J, Park S (2005) Effective and scalable solutions for mixed and split citation problems in digital libraries. In: ACM SIGMOD workshop on information quality in information systems (IQIS), June 2005
Ley M (2002) The DBLP computer science bibliography: Evolution, research issues, perspectives. In: Int’l symp on string processing and information retrieval (SPIRE), Lisbon, Portugal, Sep 2002
CiteSeer: Scientific literature digital library. http://www.citeseer.org/
Li X, Ma B, Li M, Chen X, Vitanyi M (2004) The similarity metric. IEEE Tran Inf Theory
Malin B (2005) Unsupervised name disambiguation via social network similarity. In: SIAM SDM workshop on link analysis, counterterrorism and security
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM KDD, pp 169–178
Monge AE, Elkan C (1996) The field matching problem: Algorithms and applications, 23–29
Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. J Data Min Knowl Discov
Monge AE, Elkan CP (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records, 23–29
RIDDLE: Repository of information on duplicate detection record linkage and identity uncertainty. http://www.cs.utexas.edu/users/ml/riddle/data.html
On B-W, Lee D, Kang J, Mitra P (2005) Comparative study of name disambiguation problem using a scalable blocking-based framework. In: ACM/IEEE joint conf on digital libraries, June 2005
On B-W, Koudas N, Lee D, Srivastava D (2007) Group linkage
Pasula H, Marthi B, Milch B, Russell S, Shpitser I (2003) Identity uncertainty and citation matching. In: Advances in neural information processing systems. MIT Press, Cambridge
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning, 269–278
Shen W, Li X, Doan A (2005) Constraint-based entity matching. In: AAAI
SiMETRICS. “ ”. http://www.dcs.shef.ac.uk/sam/stringmetrics.html
SecondString: Open source Java-based package of approximate string-Matching. http://secondstring.sourceforge.net/
The MathWorks Matlab function reference. http://www.mathworks.com/
Verykios VS, Elmagarmid AK, Houstis EN (2000) Automating the approximate record matching process. Inf Sci 126(1–4):83–98
Warnner JW, Brown EW (2001) Automated name authority control. In: ACM/IEEE joint conf on digital libraries (JCDL)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
On, BW., Lee, I. Meta similarity. Appl Intell 35, 359–374 (2011). https://doi.org/10.1007/s10489-010-0226-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-010-0226-3