research-article

An analysis of duplicate on web extracted objects

Author:
Stefano Ortona

University of Oxford, Oxford, United Kingdom

University of Oxford, Oxford, United Kingdom
View Profile

WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide WebApril 2014Pages 1279–1284https://doi.org/10.1145/2567948.2579708

Published:07 April 2014Publication History

WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide Web

Pages 1279–1284

ABSTRACT

Today the web has become the largest available source of information. The automatic extraction of structured data from web is a challenging problem that has been widely investigated. However, after the extraction process, the problem of identifying duplicates among the extracted web records must be solved in order to present clean data to the final user. This problem, also known as record linkage or record matching, has been of central interest for the database community; however, only few works have addressed this problem in the web context. In this paper we present web object matching, the problem of identifying duplicates among records extracted from the web.

We will show that in the web scenario we need to face all the problems of a classic record linkage setting plus the uncertainty introduced by the web. Indeed the records are the output of an extraction system that, rather than conventional databases or APIs, introduces semantic errors that are not due to a problem in the source. Most of the previous approaches rely on the fact that the records to match contain the correct information and we can use such information to identify duplicates. In this work we overview an approach that performs a validation step before the actual identification of duplicates, in order to check whether the information of the record can be trusted or not. We present an approach that works without any human supervision or training data and that deals with the problem not only in a record-by-record fashion (as other approaches), but also in a source-by-source fashion which allows detecting and possibly correcting systematic errors for an entire source. The only human effort required is the creation of a little knowledge about the domain of interest through a set of ontology constraints and an entity extraction system.

References

R. Agrawal and S. Ieong. Aggregating web offers to determine product prices. In Proc. of KDD, 2012. Google ScholarDigital Library
M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10), 2013. Google ScholarDigital Library
L. Chen, S. Ortona, G. Orsi, and M. Benedikt. Aggregating semantic annotators. PVLDB, 6(13):1486--1497, 2013. Google ScholarDigital Library
P. Christen. Febrl-: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proc. of KDD, 2008. Google ScholarDigital Library
P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. KDE, 24(9), 2012. Google ScholarDigital Library
W. W. Cohen, P. D. Ravikumar, S. E. Fienberg, et al. A comparison of string distance metrics for name-matching tasks. In IIWeb, volume 2003, pages 73--78, 2003.Google Scholar
V. Crescenzi, P. Merialdo, and D. Qiu. A framework for learning web wrappers from the crowd. In Proc. of WWW, pages 261--272, 2013. Google ScholarDigital Library
N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4), 2011. Google ScholarDigital Library
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1), 2007. Google ScholarDigital Library
H. C. et. al. Text Processing with GATE (Version 6). U. Sheffield Dept. of CS, 2011.Google Scholar
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In Proc. of SIGMOD, 2011. Google ScholarDigital Library
T. Furche and et. al. Diadem: domain-centric, intelligent, automated data extraction methodology. In Proc. of WWW, 2012. Google ScholarDigital Library
H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: an extensible data cleaning tool. SIGMOD, 29(2), 2000. Google ScholarDigital Library
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The llunatic data-cleaning framework. PVLDB, 6(9), 2013. Google ScholarDigital Library
V. Gopalakrishnan, S. P. Iyengar, A. Madaan, R. Rastogi, and S. Sengamedu. Matching product titles using web-based enrichment. In Proc. of CIKM, 2012. Google ScholarDigital Library
P. Gulhane, R. Rastogi, S. H. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. PVLDB, 3(1--2), 2010. Google ScholarDigital Library
M. A. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, 84(406), 1989.Google Scholar
A. Kannan, I. E. Givoni, R. Agrawal, and A. Fuxman. Matching unstructured product offers to structured product specifications. In Proc. of KDD, 2011. Google ScholarDigital Library
L. Kolb, A. Thor, and E. Rahm. Dedoop: efficient deduplication with hadoop. PVLDB, 5(12):1878--1881, 2012. Google ScholarDigital Library
H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1--2), 2010. Google ScholarDigital Library
H. Köpcke, A. Thor, S. Thomas, and E. Rahm. Tailoring entity resolution for matching product offers. In Proc. of EDBT, 2012. Google ScholarDigital Library
M. Lenzerini. Data integration: A theoretical perspective. In Proc. of PODS, 2002. Google ScholarDigital Library
P. Li, X. L. Dong, A. Maurino, and D. Srivastava. Linking temporal records. PVLDB, 4(11), 2011.Google Scholar
W. Lup Low, M. Li Lee, and T. Wang Ling. A knowledge-based approach for duplicate elimination in data cleaning. IS, 26(8), 2001. Google ScholarDigital Library
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 2000.Google Scholar
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, volume 1, 2001. Google ScholarDigital Library
W. Su, J. Wang, and F. H. Lochovsky. Record matching over query results from multiple web databases. TKDE, 22(4), 2010. Google ScholarDigital Library
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11), 2012. Google ScholarDigital Library
J. Wang, G. Li, J. X. Yu, and J. Feng. Entity matching: how similar is similar. PVLDB, 4(10), 2011. Google ScholarDigital Library
S. E. Whang and H. Garcia-Molina. Joint entity resolution. In ICDE, 2012. Google ScholarDigital Library

Index Terms

An analysis of duplicate on web extracted objects
1. Information systems
  1. Information systems applications
    1. Data mining
  2. World Wide Web
    1. Web applications
    2. Web services

Recommendations

Subsequent patient visit detection in a high volume OPD using record linkage techniques
COMPUTE '10: Proceedings of the Third Annual ACM Bangalore Conference

Record or data linkage techniques are used to link records which represent the same entity (e.g. patient, customer, citation, etc.) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be ...
Read More
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Matching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant ...
Read More
Febrl: a freely available record linkage system with a graphical user interface
HDKM '08: Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80

Record or data linkage is an important enabling technology in the health sector, as linked data is a cost-effective resource that can help to improve research into health policies, detect adverse drug reactions, reduce costs, and uncover fraud within the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide Web
April 2014
1396 pages
ISBN:9781450327459
DOI:10.1145/2567948
General Chair:
Chin-Wan Chung
Korea Advanced Institute of Science and Technology, Korea
,
Program Chairs:
Andrei Broder
Google Inc., USA
,
Kyuseok Shim
Seoul National University, Korea
,
Torsten Suel
New York University, USA
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 April 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data cleaning
data extraction
deduplication
matching
record linkage
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 158
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An analysis of duplicate on web extracted objects

WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Subsequent patient visit detection in a high volume OPD using record linkage techniques

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Febrl: a freely available record linkage system with a graphical user interface