skip to main content
10.1145/2567948.2579708acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

An analysis of duplicate on web extracted objects

Published:07 April 2014Publication History

ABSTRACT

Today the web has become the largest available source of information. The automatic extraction of structured data from web is a challenging problem that has been widely investigated. However, after the extraction process, the problem of identifying duplicates among the extracted web records must be solved in order to present clean data to the final user. This problem, also known as record linkage or record matching, has been of central interest for the database community; however, only few works have addressed this problem in the web context. In this paper we present web object matching, the problem of identifying duplicates among records extracted from the web.

We will show that in the web scenario we need to face all the problems of a classic record linkage setting plus the uncertainty introduced by the web. Indeed the records are the output of an extraction system that, rather than conventional databases or APIs, introduces semantic errors that are not due to a problem in the source. Most of the previous approaches rely on the fact that the records to match contain the correct information and we can use such information to identify duplicates. In this work we overview an approach that performs a validation step before the actual identification of duplicates, in order to check whether the information of the record can be trusted or not. We present an approach that works without any human supervision or training data and that deals with the problem not only in a record-by-record fashion (as other approaches), but also in a source-by-source fashion which allows detecting and possibly correcting systematic errors for an entire source. The only human effort required is the creation of a little knowledge about the domain of interest through a set of ontology constraints and an entity extraction system.

References

  1. R. Agrawal and S. Ieong. Aggregating web offers to determine product prices. In Proc. of KDD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Chen, S. Ortona, G. Orsi, and M. Benedikt. Aggregating semantic annotators. PVLDB, 6(13):1486--1497, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Christen. Febrl-: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proc. of KDD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. KDE, 24(9), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. W. Cohen, P. D. Ravikumar, S. E. Fienberg, et al. A comparison of string distance metrics for name-matching tasks. In IIWeb, volume 2003, pages 73--78, 2003.Google ScholarGoogle Scholar
  7. V. Crescenzi, P. Merialdo, and D. Qiu. A framework for learning web wrappers from the crowd. In Proc. of WWW, pages 261--272, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. C. et. al. Text Processing with GATE (Version 6). U. Sheffield Dept. of CS, 2011.Google ScholarGoogle Scholar
  11. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In Proc. of SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Furche and et. al. Diadem: domain-centric, intelligent, automated data extraction methodology. In Proc. of WWW, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: an extensible data cleaning tool. SIGMOD, 29(2), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The llunatic data-cleaning framework. PVLDB, 6(9), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. V. Gopalakrishnan, S. P. Iyengar, A. Madaan, R. Rastogi, and S. Sengamedu. Matching product titles using web-based enrichment. In Proc. of CIKM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Gulhane, R. Rastogi, S. H. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. PVLDB, 3(1--2), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. A. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, 84(406), 1989.Google ScholarGoogle Scholar
  18. A. Kannan, I. E. Givoni, R. Agrawal, and A. Fuxman. Matching unstructured product offers to structured product specifications. In Proc. of KDD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Kolb, A. Thor, and E. Rahm. Dedoop: efficient deduplication with hadoop. PVLDB, 5(12):1878--1881, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1--2), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. H. Köpcke, A. Thor, S. Thomas, and E. Rahm. Tailoring entity resolution for matching product offers. In Proc. of EDBT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Lenzerini. Data integration: A theoretical perspective. In Proc. of PODS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Li, X. L. Dong, A. Maurino, and D. Srivastava. Linking temporal records. PVLDB, 4(11), 2011.Google ScholarGoogle Scholar
  24. W. Lup Low, M. Li Lee, and T. Wang Ling. A knowledge-based approach for duplicate elimination in data cleaning. IS, 26(8), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 2000.Google ScholarGoogle Scholar
  26. V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, volume 1, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Su, J. Wang, and F. H. Lochovsky. Record matching over query results from multiple web databases. TKDE, 22(4), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Wang, G. Li, J. X. Yu, and J. Feng. Entity matching: how similar is similar. PVLDB, 4(10), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. E. Whang and H. Garcia-Molina. Joint entity resolution. In ICDE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An analysis of duplicate on web extracted objects

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide Web
          April 2014
          1396 pages
          ISBN:9781450327459
          DOI:10.1145/2567948

          Copyright © 2014 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 April 2014

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader