Skip to main content
Log in

Approximate data instance matching: a survey

  • Survey Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Approximate data matching is a central problem in several data management processes, such as data integration, data cleaning, approximate queries, similarity search and so on. An approximate matching process aims at defining whether two data represent the same real-world object. For atomic values (strings, dates, etc), similarity functions have been defined for several value domains (person names, addresses, and so on). For matching aggregated values, such as relational tuples and XML trees, approaches alternate from the definition of simple functions that combine values of similarity of record attributes to sophisticated techniques based on machine learning, for example. For complex data comparison, including structured and semistructured documents, existing approaches use both structure and data for the comparison, by either considering or not considering data semantics. This survey presents terminology and concepts that base approximated data matching, as well as discusses related work on the use of similarity functions in such a subject.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal R, Faloutsos C, Swami AN (1993) Efficient similarity search in sequence databases. In: Proceedings of the 4th international conference on foundations of data organization and algorithms (FODO). Springer, London, pp 69–84

  2. Al-Khalifa S, Yu C, Jagadish HV (2003) Querying structured text in an xml database. In: Proceedings of the 29th ACM SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 4–15

  3. Arasu A, Chaudhuri S, Kaushik R (2008) Transformation-based framework for record matching. In: Proceedings of the 2008 IEEE 24th international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 40–49

  4. Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the 32nd international conference on very large data bases (VLDB). VLDB Endowment, pp 918–929

  5. Aygun RS (2008) S2s: structural-to-syntactic matching similar documents. Knowl Inf Syst 16(3): 303–329

    Article  Google Scholar 

  6. Barioni MC, Razente HL, Traina C Jr, Traina AJM (2005) Querying complex objects by similarity in sql. In: Proceedings of the 20th Brazilian symposium on databases (SBBD), pp 130–144

  7. Bhattacharya I, Getoor L (2005) Relational clustering for multi-type entity resolution. In: Proceedings of the 4th international workshop on multi-relational mining (MRDM). ACM, New York, pp 3–12

  8. Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name matching in information integration. IEEE Intell Syst 18(5): 16–23

    Article  Google Scholar 

  9. Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, New York, pp 39–48

  10. Bollacker K, Lawrence S, Giles CL (1998) CiteSeer: an autonomous web agent for automatic retrieval and identification of interesting publications. In: Proceedings of the 2nd international conference on autonomous agents. ACM Press, New York, pp 116–123

  11. Boser BE, Guyon IM, Vapnik V (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory (COLT). ACM, New York, pp 144–152

  12. Broder A (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences (SEQUENCES). IEEE Computer Society, Washington, p 21

  13. Bruno N, Chaudhuri S, Gravano L (2002) Top-k selection queries over relational databases: mapping strategies and performance evaluation. ACM Trans Database Syst 27(2): 153–187

    Article  Google Scholar 

  14. Bryan B, Schneider J, Nichol R, Miller C, Genovese C, Wasserman L (2006) Active learning for identifying function threshold boundaries. Advances in neural information processing systems (NIPS). MIT Press

  15. Bueno R, Traina AJM, Traina C Jr (2005) Accelerating approximate similarity queries using genetic algorithms. In: Proceedings of the 2005 ACM symposium on applied computing (SAC), Santa Fe, New Mexico, USA, pp 617–622

  16. Buttler D (2004) A short survey of document structure similarity algorithms. In: Proceedings of the international conference on internet computing, Las Vegas, Nevada, USA, pp 3–9

  17. Carman MJ, Knoblock CA (2005) Inducing source descriptions for automated web service composition. In: Proceedings of the AAAI workshop on exploring planning and scheduling for web services, grid, and autonomic computing, Pittsburgh, Pennsylvania

  18. Carvalho JCP, da Silva AS (2003) Finding similar identities among objects from multiple web sources. In: Proceedings of the 5th ACM international workshop on web information and data management (WIDM). ACM, New York, pp 90–93

  19. Chapman S (2004) SimMetrics: a Java & C # .NET library of similarity metrics. http://sourceforge.net/projects/simmetrics/. Accessed 13 March 2009

  20. Chaudhuri S, Chen B-C, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd international conference on very large data bases (VLDB), VLDB Endowment, pp 327–338

  21. Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 29th ACM SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 313–324

  22. Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering (ICDE). IEEE Computer Society, Washington, p 5

  23. Chaudhuri S, Ganti V, Motwani R (2005) Robust identification of fuzzy duplicates. In: Proceedings of the 21st international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 865–876

  24. Chaudhuri S, Gravano L (1999) Evaluating top-k selection queries. In: Proceedings of the 25th international conference on very large data bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, pp 397–410

  25. Chawathe SS, Garcia-Molina H (1997) Meaningful change detection in structured data. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 26–37

  26. Cheng T, Chang KC-C (2007) Entity search engine: towards agile best-effort information integration over the web. In: Third biennial conference on innovative data systems research (CIDR), Asilomar, CA, USA, pp 108–113

  27. Chuan Xiao, Wei Wang XL (2008) Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: Proceedings of the VLDB Endow. VLDB Endowment, pp 933–944

  28. Cohen WW (1998a) Integration of heterogeneous databases without common domains using queries based on textual similarity, pp 201–212

  29. Cohen WW (1998b) Providing database-like access to the web using queries based on textual similarity. SIGMOD Rec 27(2): 558–560

    Article  Google Scholar 

  30. Cohen WW (1999) Recognizing structure in web pages using similarity queries. In: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence (AAAI/IAAI). American Association for Artificial Intelligence, Menlo Park, pp 59–66

  31. Cohen WW (2000) Whirl: a word-based information representation language. Artif Intell 118(1–2): 163–196

    Article  MATH  Google Scholar 

  32. Cohen WW, Ravikumar P, Fienberg S (2003a) Secondstring: open source java-based package of approximate string-matching. http://secondstring.sourceforge.net/. Accessed 20 Dec 2005

  33. Cohen WW, Ravikumar P, Fienberg SE (2003b) A comparison of string distance metrics for name-matching tasks. In: Proceedings of workshop on information integration on the Web, IIWeb, pp 73–78

  34. Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, New York, pp 475–480

  35. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6): 391–407

    Article  Google Scholar 

  36. Dey D, Sarkar S (1996) A probabilistic relational model and algebra. ACM Trans Database Syst 21(3): 339–369

    Article  Google Scholar 

  37. Diaconis P, Graham R (1997) Spearman’s footrule as a measure of disarray. J R Stat Soc 39(2): 262–268

    MathSciNet  Google Scholar 

  38. Doan A, Lu Y, Lee Y, Han J (2003) Profile-based object matching for information integration. IEEE Intell Syst 18(5): 54–59

    Article  Google Scholar 

  39. Dorneles CF, Heuser CA, da Silva AS, de Moura ES (2009) A generic strategy for combining similarity metrics in approximate matching. Inf Syst (Oxford) 34: 673–689

    Google Scholar 

  40. Dorneles CF, Heuser CA, Lima AEN, da Silva AS, de Moura ES (2004) Measuring similarity between collection of values. In: Proceedings of the 6th annual ACM international workshop on web information and data management (WIDM). ACM, New York, pp 56–63

  41. Dulucq S, Touzet H (2003) Analysis of tree edit distance algorithms. In: Proceedings of the 14th annual symposium combinatorial pattern matching (CPM), pp 83–95

  42. Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19: 1–16

    Article  Google Scholar 

  43. Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2005) Fast detection of XML structural similarity. IEEE Trans Knowl Data Eng 17(2): 160–175

    Article  Google Scholar 

  44. Galhardas H, Florescu D, Shasha D, Simon E (2000) An extensible framework for data cleaning. In: Proceedings of the 16th international conference on data engineering (ICDE). IEEE Computer Society, Washington, p 312

  45. Galhardas H, Florescu D, Shasha D, Simon E, Saita C-A (2001) Declarative data cleaning: language, model, and algorithms. In: Proceedings of the 27th international conference on very large data bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, pp 371–380

  46. Gao L, Wang M, Wang XS, Padmanabhan S (2004) Expressing and optimizing similarity-based queries in sql. In: 23rd international conference on conceptual modeling (ER), Shanghai, China, pp 464–478

  47. Getoor, L, Taskar, B (eds) (2007) Introduction to statistical relational learning. MIT Press, Cambridge

    MATH  Google Scholar 

  48. Giles CL, Bollacker K, Lawrence S (1998) CiteSeer: an automatic citation indexing system. In: Proceedings of the third ACM conference on digital libraries (DL). ACM, New York, pp 89–98

  49. Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (almost) for free. In: Proceedings of the 27th international conference on very large data bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, pp 491–500

  50. Gravano L, Ipeirotis PG, Koudas N, Srivastava D (2003) Text joins in an rdbms for web data integration. In: Proceedings of the 12th international conference on World Wide Web (WWW). ACM, New York, pp 90–101

  51. Guha S, Koudas N, Marathe A, Srivastava D (2004) Merging the results of approximate match operations. In: Proceedings of the thirtieth international conference on very large data bases (VLDB). VLDB Endowment, pp 636–647

  52. Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. SIGMOD Rec 24(2): 127–138

    Article  Google Scholar 

  53. Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1): 9–37

    Article  Google Scholar 

  54. Huffman S, Steier D (1995a) Heuristic joins to integrate structured heterogeneous data. In: Working notes of the AAAI spring symposium on information gathering from heterogeneous, distributed environment, pp 74–77

  55. Huffman SB, Steier D (1995b) A navigation assistant for data source selection and integration, Working papers, Price Waterhouse

  56. Ilyas F, Aref G, Elmagarmid K (2004) Supporting top-k join queries in relational databases. VLDB J 13(3): 754–765

    Article  Google Scholar 

  57. Ilyas IF, Aref WG (2005) Rank-aware query processing and optimization. In: Proceedings of the 21st international conference on data engineering (ICDE). IEEE Computer Society, Washington, p 1144

  58. Ilyas IF, Aref WG, Elmagarmid AK (2003) Supporting top-k join queries in relational databases. In: VLDB

  59. Jin L, Li C, Mehrotra S (2002) Efficient similarity string joins in large data sets, technical report, University of California, Irvine. http://www.ics.uci.edu/chenli/pub/strjoin.ps

  60. Kailing K, Kriegel H-P, Schönauer S, Seidl T (2004) Efficient similarity search for hierarchical data in large databases. In: 9th international conference on extending database technology (EDBT), Heraklion, Crete, Greece, pp 676–693

  61. Knoblock CA, Ambite JL, Thakkar S (2003) A view integration approach to dynamic composition of web services. In: Proceedings of ICAPS workshop on planning for web services

  62. Koudas N, Marathe A, Srivastava D (2004) Flexible string matching against large databases in practice. In: Proceedings of the thirtieth international conference on very large data bases (VLDB), VLDB Endowment, pp 1078–1086

  63. Laender AHF, Gonalves MA, Roberto PA (2004) Bdbcomp: building a digital library for the brazilian computer science community. In: ACM/IEEE joint conference on digital libraries (JCDL), Tuscon, AZ, USA, pp 23–24

  64. Lee L (2001) On the effectiveness of the skew divergence for statistical language analysis. Artif Intell Stat, pp 65–72

  65. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady, vol 10

  66. Luis Leit a, Calado P, Weis M (2007) Structure-based inference of xml similarity for fuzzy duplicate detection. In: Proceedings of the sixteenth ACM conference on information and knowledge management (CIKM). ACM, New York, pp 293–302

  67. Ma Y, Chbeir R (2005) Content and structure based approach for xml similarity. In: Proceedings of the fifth international conference on computer and information technology (CIT). IEEE Computer Society, Washington, pp 136–140

  68. Melnik S, Garcia-Molina H, Rahm E (2002) Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: Proceedings of the 18th international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 117–128

  69. Michalowski M, Ambite JL, Knoblock CA, Minton S, Thakkar S, Tuchinda R (2004) Retrieving and semantically integrating heterogeneous data from the web. IEEE Intell Syst 19(3)

  70. Michalowski M, Thakkar S, Knoblock CA (2005) Automatically utilizing secondary sources to align information across data sources. AI Mag 26(1): 72–79

    Google Scholar 

  71. Milano D, Scannapieco M, Catarci T (2006) Structure aware xml object identification. In: Proceedings of the first Int’l VLDB workshop on clean databases (CleanDB). Seoul, Korea

  72. Mitchell TM (1997) Machine learning. McGraw-Hill, New York

    MATH  Google Scholar 

  73. Motro A (1988) Vague: a user interface to relational databases that permits vague queries. ACM Trans Off Inf Syst 6(3): 187–214

    Article  Google Scholar 

  74. Navarro G (2001) A guided tour of approximate string matching. ACM Comput Surv 33(1): 31–88

    Article  Google Scholar 

  75. Nayak R (2008) Fast and effective clustering of xml data using structural information. Knowl Inf Syst 14(2): 197–215

    Article  MathSciNet  Google Scholar 

  76. Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: 5th international workshop on the web and databases (WebDB), Madison, Wisconsin, USA, pp 61–66

  77. Park U, Seo Y (2005) An implementation of XML documents search system based on similarity in structure and semantics. In: International workshop on challenges in web information retrieval and integration

  78. Puhlmann S, Weis M, Naumann F (2006) XML duplicate detection using sorted neighborhoods. In: Proceedings of the 10th international conference on extending database technology (EDBT), pp 773–791

  79. Ragnemalm I (1993) The euclidean distance transform, Ph.D thesis, Department of Electrical Engineering, Linkpping University

  80. Salton G, McGill M (1984) Introduction to modern information retrieval. McGraw-Hill, New York

    Google Scholar 

  81. Schallehn E, Sattler K-U, Saake G (2001) Advanced grouping and aggregation for data integration. In: CIKM ’01 proceedings of the tenth international conference on information and knowledge management. ACM, New York, pp 547–549

  82. Schallehn E, Sattler K-U, Saake G (2004) Efficient similarity-based operations for data integration. Data Knowl Eng 48(3): 361–387

    Article  Google Scholar 

  83. Shatkay H, Zdonik SB (1996) Approximate queries and representations for large data sequences. In: Proceedings of the twelfth international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 536–545

  84. Shen W, Li X, Doan A (2005) Constraint-based entity matching. In: Proceedings of the 20th national conference on artificial intelligence (AAAI). AAAI Press, pp 862–867

  85. Stasiu RK, Heuser CA, Silva R (2005) Estimating recall and precision for vague queries in databases. In: Proceedings of the 17th international conference advanced information systems engineering (CAISE), pp 187–200

  86. Takasu A, Fukagawa D, Akutsu T (2007) Statistical learning algorithm for tree similarity. In: Proceedings of the seventh IEEE international conference on data mining (ICDM). IEEE Computer Society, Washington, pp 667–672

  87. Tejada S, Knoblock CA, Minton S (2001) Learning object identification rules for information integration. Inf Syst 26(8)

  88. Tejada S, Knoblock CA, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification, pp 350–359

  89. Thor A, Rahm E (2007) Moma—a mapping-based object matching system. In: ‘Third biennial conference on innovative data systems research (CIDR). Asilomar, CA, USA, pp 247–258

  90. Tran T, Nayak R, Bruza P (2008) Combining structure and content similarities for xml document clustering. In: Proceedings of the 7th Australasian data mining conference (AusDM), vol 87. ACS, Glenelg, pp 219–226

  91. Vinson AR, Heuser CA, Silva AS, de Moura ES (2007) An approach to xml path matching. In: Proceedings of the 9th annual ACM international workshop on web information and data management (WIDM). ACM, New York, pp 17–24

  92. Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inf Syst 15(1): 55–73

    Article  Google Scholar 

  93. Wang T-Y, Ré C, Suciu D (2008) Implementing not exists predicates over a probabilistic database. In: Proceedings of the international workshop on quality in databases and management of uncertain data, pp 73–86

  94. Wang W, Xiao C, Lin X, Zhang C (2009) Efficient approximate entity extraction with edit distance constraints. In: Proceedings of the 35th SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 759–770

  95. Weis M, Naumann F (2004) Detecting duplicate objects in xml documents. In: Proceedings of the 2004 international workshop on information quality in information systems (IQIS). ACM, New York, pp 10–19

  96. Weis M, Naumann F (2005) Dogmatix tracks down duplicates in xml. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 431–442

  97. Weis M, Naumann F (2006) Detecting duplicates in complex xml data. In: Proceedings of the 22nd international conference on data engineering (ICDE). IEEE Computer Society, Washington, p 109

  98. Widom J (2008) Managing and mining uncertain data. In: Trio: a system for data, uncertainty, and lineage. Springer, Berlin

  99. Xiao C, Wang W, Lin X, Shang H (2009) Top-k set similarity joins. In: Proceedings of the IEEE international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 916–927

  100. Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceeding of the 17th international conference on World Wide Web (WWW). ACM, New York, pp 131–140

  101. Yang J, Cheung WK, Chen X (2005) Integrating element and term semantics for similarity-based xml document clustering. In: Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence (WI). IEEE Computer Society, Washington, pp 222–228

  102. Yongming G, Dehua C, Jiajin L (2008) Clustering xml documents by combining content and structure. In: Proceedings of the international symposium on information science and engieering. IEEE Computer Society, Los Alamitos, pp 583–587

  103. Yu CT, Philip G, Meng W (2003) Distributed top-n query processing with possibly uncooperative local systems. In: Proceedings of the 29th international conference on very large data bases (VLDB). VLDB Endowment, pp 117–128

  104. Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15(3): 321–334

    Article  Google Scholar 

  105. Zhao H, Ram S (2005) Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation. Inf Syst 30(2): 119–132

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carina Friedrich Dorneles.

Additional information

This work was partially done while the author C. F. Dorneles was a Ph.D. student at the Informatic Institute, UFRGS, and was supported by Capes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dorneles, C.F., Gonçalves, R. & dos Santos Mello, R. Approximate data instance matching: a survey. Knowl Inf Syst 27, 1–21 (2011). https://doi.org/10.1007/s10115-010-0285-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0285-0

Keywords

Navigation