XML Duplicate Detection Using Sorted Neighborhoods

Puhlmann, Sven; Weis, Melanie; Naumann, Felix

doi:10.1007/11687238_46

XML Duplicate Detection Using Sorted Neighborhoods

Sven Puhlmann²⁵,
Melanie Weis²⁵ &
Felix Naumann²⁵

Conference paper

1708 Accesses
23 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3896))

Abstract

Detecting duplicates is a problem with a long tradition in many domains, such as customer relationship management and data warehousing. The problem is twofold: First define a suitable similarity measure, and second efficiently apply the measure to all pairs of objects. With the advent and pervasion of the XML data model, it is necessary to find new similarity measures and to develop efficient methods to detect duplicate elements in nested XML data.

A classical approach to duplicate detection in flat relational data is the sorted neighborhood method, which draws its efficiency from sliding a window over the relation and comparing only tuples within that window. We extend the algorithm to cover not only a single relation but nested XML elements. To compare objects we make use of XML parent and child relationships. For efficiency, we apply the windowing technique in a bottom-up fashion, detecting duplicates at each level of the XML hierarchy. Experiments show a speedup comparable to the original method data and they show the high effectiveness of our algorithm in detecting XML duplicates.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Winkler, W.E.: Advanced methods for record linkage. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC (1994)
Google Scholar
Hernändez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD Conference, San Jose, CA, pp. 127–138 (1995)
Google Scholar
Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. In: ICDE Conference, Vienna, Austria, pp. 294–301 (1993)
Google Scholar
Doan, A., Lu, Y., Lee, Y., Han, J.: Object matching for information integration: A profiler-based approach. In: IEEE Intelligent Systems, pp. 54–59 (2003)
Google Scholar
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: International Conference on VLDB, Hong Kong, China (2002)
Google Scholar
Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML joins. In: SIGMOD Conference, Madison, Wisconsin, USA, 287–298 (2002)
Google Scholar
Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD Conference, Baltimore, MD, 85–96 (2005)
Google Scholar
Weis, M., Naumann, F.: DogmatiX Tracks down Duplicates in XML. In: SIGMOD Conference, Baltimore, MD (2005)
Google Scholar
Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Article Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association (1969)
Google Scholar
Jaro, M.A.: Probabilistic linkage of large public health data files. Statistics in Medicine 14, 491–498 (1995)
Article Google Scholar
Quass, D., Starkey, P.: Record linkage for genealogical databases. In: KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 40–42 (2003)
Google Scholar
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)
Article Google Scholar
Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tuscon, AZ, pp. 23–29 (1997)
Google Scholar
Kailing, K., Kriegel, H.-P., Schönauer, S., Seidl, T.: Efficient similarity search for hierarchical data in large databases. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 676–693. Springer, Heidelberg (2004)
Chapter Google Scholar
Carvalho, J.C., da Silva, A.S.: Finding similar identities among objects from multiple web sources. In: CIKM Workshop on Web Information and Data Management, New Orleans, Louisiana, USA, pp. 90–93 (2003)
Google Scholar
Weis, M., Naumann, F.: Duplicate detection in XML. In: SIGMOD Workshop on Information Quality in Information Systems, Paris, France, pp. 10–19 (2004)
Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Article Google Scholar
Hernández, M.A.: A Generalization of Band Joins and The Merge/Purge Problem. PhD thesis, Columbia University, Department of Computer Science, New York (1996)
Google Scholar
Lehti, P., Fankhauser, P.: A precise blocking method for record linkage. In: International Conference on Data Warehousing and Knowledge Discovery, DaWaK, Copenhagen, Denmark, pp. 210–220 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany
Sven Puhlmann, Melanie Weis & Felix Naumann

Authors

Sven Puhlmann
View author publications
You can also search for this author in PubMed Google Scholar
Melanie Weis
View author publications
You can also search for this author in PubMed Google Scholar
Felix Naumann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Athens, Greece
Yannis Ioannidis
University of Konstanz, P.O.Box D188, 78457, Konstanz, Germany
Marc H. Scholl
Sustainable Content Logistics Centre, Hamburg, Germany
Joachim W. Schmidt
Chair of Software Engineering for Business Information Systems, Technische Universität München, Boltzmannstraße 3, 85748, Garching b. München,
Florian Matthes
Department of Informatics, University of Athens Panepistimiopolis, 15771, Athens, Greece
Mike Hatzopoulos
IPD, Universität Karlsruhe, Am Fasanengarten 5, 76131, Karlsruhe,
Klemens Boehm
TU München, D-85748, Garching, Germany
Alfons Kemper
Technische Universität München, Germany
Torsten Grust
Institute for Computer Science, Ludwig-Maximilians Universität München,
Christian Boehm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Puhlmann, S., Weis, M., Naumann, F. (2006). XML Duplicate Detection Using Sorted Neighborhoods. In: Ioannidis, Y., et al. Advances in Database Technology - EDBT 2006. EDBT 2006. Lecture Notes in Computer Science, vol 3896. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11687238_46

Download citation

DOI: https://doi.org/10.1007/11687238_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32960-2
Online ISBN: 978-3-540-32961-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics