Meta similarity

On, Byung-Won; Lee, Ingyu

doi:10.1007/s10489-010-0226-3

Meta similarity

Published: 27 March 2010

Volume 35, pages 359–374, (2011)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Byung-Won On¹ &
Ingyu Lee²

179 Accesses
8 Citations
Explore all metrics

Abstract

To see if two given strings are matched, various string similarity metrics have been employed and these string similarities can be categorized into three classes: (a) Edit-distance-based similarities, (b) Token-based similarities, and (c) Hybrid similarities. In essence, since different types of string similarities have different pros and cons in measuring the similarity between two strings, string similarity metrics in each class are likely to work well for particular data sets. Toward this problem, we propose a novel Meta Similarity that both (i) outperforms the existing similarity metrics and (ii) is the least affected by a variety of data sets. Our claim is empirically validated through extensive experimental tests—our proposal shows an improvement to the largest 20% average recall, compared to the best case of the existing similarity metrics and our method is the most stable, showing from 0.95 to 1.0 average recall range in all the data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: VLDB
arXiv.org e Print archive. http://arxiv.org/
Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Int’l world wide web conf (WWW)
Benjelloun O, Garcia-Molina H, Su Q, Widom J (2005) Swoosh: A generic approach to entity resolution. Technical report, Stanford University
Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: ACM SIGMOD workshop on research issues in data mining and knowledge discovery
Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name-matching in information integration. IEEE Intell Syst 18(5):16–23
Article Google Scholar
Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: ACM SIGMOD
Cohen W, Ravikumar P, Fienberg S (2003) A comparison of string distance metrics for name-matching tasks. In: II Web workshop held in conjunction with IJCAI
Cohen WW (2000) Data integration using similarity joins and a word-based information representation language. Inf Syst
Digital bibliography and library project (DBLP). http://dblp.uni-trier.de/
Dong X, Halevy AY, Madhavan J (2005) Reference reconciliation in complex information spaces. In: ACM SIGMOD
Elfeky MG, Verykios VS, Elmagarmid AK (2002) TAILOR: A record linkage toolbox. In: IEEE ICDE, February 2002
Elfeky MG, Verykios VS, Elmagarmid AK (2002) TAILOR: A record linkage toolbox. San Jose, California, Feb 2002
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: A survey. IEEE Trans Knowl Data Eng 19(1):1–16
Article Google Scholar
Fellegi IP, Sunter AB (1964) A theory for record linkage. J Am Stat Assoc 1183–1210
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Soc 64:1183–1210
Google Scholar
Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Assoc. Comput Linguist 19(1)
Golub GH, van Loan CF (1999) Matrix computations. Johns Hopkins University Press, Baltimore
Google Scholar
Gravano L, Ipeirotis P, Jagadish HV, Koudas N, Muthukrishnan S, Pietarinen L, Srivastava D (2001) Using q-grams in a DBMS for approximate string processing. IEEE Data Eng Bull 24(4)
Han H, Zha H, Lee Giles C (2005) Name disambiguation in author citations using a K-way spectral clustering method. In: ACM/IEEE joint conf on digital libraries (JCDL), Jun 2005
Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: ACM SIGMOD
Hernandez MA, Stolf SJ (1998) Real-world data is dirty: Data cleansing and the merge/purge problem. J Data Min Knowl Discov
Hong Y, On B-W, Lee D (2004) System support for name authority control problem in digital libraries: OpenDBLP approach. In: European conf on digital libraries, Bath, UK, September 2004
Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 84(406):414–420
Article Google Scholar
Kalashnikov DV, Mehrotra S, Chen Z (2005) Exploiting relationships for domain-independent data cleaning. In: SIAM data mining (SDM) conf
Lawrence S, Giles CL, Bollacker K (1999) Digital libraries and autonomous citation indexing. IEEE Comput 32(6):67–71
Google Scholar
Lee D, Kang J, Mitra P, Giles C, On B (2006) Are your citations clean? New challenges and scenarios in maintaining digital libraries. In: Communications of the ACM
Lee D, On B-W, Kang J, Park S (2005) Effective and scalable solutions for mixed and split citation problems in digital libraries. In: ACM SIGMOD workshop on information quality in information systems (IQIS), June 2005
Ley M (2002) The DBLP computer science bibliography: Evolution, research issues, perspectives. In: Int’l symp on string processing and information retrieval (SPIRE), Lisbon, Portugal, Sep 2002
CiteSeer: Scientific literature digital library. http://www.citeseer.org/
Li X, Ma B, Li M, Chen X, Vitanyi M (2004) The similarity metric. IEEE Tran Inf Theory
Malin B (2005) Unsupervised name disambiguation via social network similarity. In: SIAM SDM workshop on link analysis, counterterrorism and security
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM KDD, pp 169–178
Monge AE, Elkan C (1996) The field matching problem: Algorithms and applications, 23–29
Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. J Data Min Knowl Discov
Monge AE, Elkan CP (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records, 23–29
RIDDLE: Repository of information on duplicate detection record linkage and identity uncertainty. http://www.cs.utexas.edu/users/ml/riddle/data.html
On B-W, Lee D, Kang J, Mitra P (2005) Comparative study of name disambiguation problem using a scalable blocking-based framework. In: ACM/IEEE joint conf on digital libraries, June 2005
On B-W, Koudas N, Lee D, Srivastava D (2007) Group linkage
Pasula H, Marthi B, Milch B, Russell S, Shpitser I (2003) Identity uncertainty and citation matching. In: Advances in neural information processing systems. MIT Press, Cambridge
Google Scholar
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning, 269–278
Shen W, Li X, Doan A (2005) Constraint-based entity matching. In: AAAI
SiMETRICS. “ ”. http://www.dcs.shef.ac.uk/sam/stringmetrics.html
SecondString: Open source Java-based package of approximate string-Matching. http://secondstring.sourceforge.net/
The MathWorks Matlab function reference. http://www.mathworks.com/
Verykios VS, Elmagarmid AK, Houstis EN (2000) Automating the approximate record matching process. Inf Sci 126(1–4):83–98
Google Scholar
Warnner JW, Brown EW (2001) Automated name authority control. In: ACM/IEEE joint conf on digital libraries (JCDL)

Download references

Author information

Authors and Affiliations

Singapore Management University, Singapore, Singapore
Byung-Won On
Troy University, Troy, USA
Ingyu Lee

Authors

Byung-Won On
View author publications
You can also search for this author in PubMed Google Scholar
Ingyu Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Byung-Won On.

Rights and permissions

Reprints and permissions

About this article

Cite this article

On, BW., Lee, I. Meta similarity. Appl Intell 35, 359–374 (2011). https://doi.org/10.1007/s10489-010-0226-3

Download citation

Published: 27 March 2010
Issue Date: December 2011
DOI: https://doi.org/10.1007/s10489-010-0226-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Meta similarity

Abstract

Access this article

Similar content being viewed by others

An Improved String Similarity Measure Based on Combining Information-Theoretic and Edit Distance Methods

Bounded Occurrence Edit Distance: A New Metric for String Similarity Joins with Edit Distance Constraints

Comparison of Methods to Assess Similarity between Phrases

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Meta similarity

Abstract

Access this article

Similar content being viewed by others

An Improved String Similarity Measure Based on Combining Information-Theoretic and Edit Distance Methods

Bounded Occurrence Edit Distance: A New Metric for String Similarity Joins with Edit Distance Constraints

Comparison of Methods to Assess Similarity between Phrases

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation