ABSTRACT
Today the web has become the largest available source of information. The automatic extraction of structured data from web is a challenging problem that has been widely investigated. However, after the extraction process, the problem of identifying duplicates among the extracted web records must be solved in order to present clean data to the final user. This problem, also known as record linkage or record matching, has been of central interest for the database community; however, only few works have addressed this problem in the web context. In this paper we present web object matching, the problem of identifying duplicates among records extracted from the web.
We will show that in the web scenario we need to face all the problems of a classic record linkage setting plus the uncertainty introduced by the web. Indeed the records are the output of an extraction system that, rather than conventional databases or APIs, introduces semantic errors that are not due to a problem in the source. Most of the previous approaches rely on the fact that the records to match contain the correct information and we can use such information to identify duplicates. In this work we overview an approach that performs a validation step before the actual identification of duplicates, in order to check whether the information of the record can be trusted or not. We present an approach that works without any human supervision or training data and that deals with the problem not only in a record-by-record fashion (as other approaches), but also in a source-by-source fashion which allows detecting and possibly correcting systematic errors for an entire source. The only human effort required is the creation of a little knowledge about the domain of interest through a set of ontology constraints and an entity extraction system.
- R. Agrawal and S. Ieong. Aggregating web offers to determine product prices. In Proc. of KDD, 2012. Google ScholarDigital Library
- M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10), 2013. Google ScholarDigital Library
- L. Chen, S. Ortona, G. Orsi, and M. Benedikt. Aggregating semantic annotators. PVLDB, 6(13):1486--1497, 2013. Google ScholarDigital Library
- P. Christen. Febrl-: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proc. of KDD, 2008. Google ScholarDigital Library
- P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. KDE, 24(9), 2012. Google ScholarDigital Library
- W. W. Cohen, P. D. Ravikumar, S. E. Fienberg, et al. A comparison of string distance metrics for name-matching tasks. In IIWeb, volume 2003, pages 73--78, 2003.Google Scholar
- V. Crescenzi, P. Merialdo, and D. Qiu. A framework for learning web wrappers from the crowd. In Proc. of WWW, pages 261--272, 2013. Google ScholarDigital Library
- N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4), 2011. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1), 2007. Google ScholarDigital Library
- H. C. et. al. Text Processing with GATE (Version 6). U. Sheffield Dept. of CS, 2011.Google Scholar
- W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In Proc. of SIGMOD, 2011. Google ScholarDigital Library
- T. Furche and et. al. Diadem: domain-centric, intelligent, automated data extraction methodology. In Proc. of WWW, 2012. Google ScholarDigital Library
- H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: an extensible data cleaning tool. SIGMOD, 29(2), 2000. Google ScholarDigital Library
- F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The llunatic data-cleaning framework. PVLDB, 6(9), 2013. Google ScholarDigital Library
- V. Gopalakrishnan, S. P. Iyengar, A. Madaan, R. Rastogi, and S. Sengamedu. Matching product titles using web-based enrichment. In Proc. of CIKM, 2012. Google ScholarDigital Library
- P. Gulhane, R. Rastogi, S. H. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. PVLDB, 3(1--2), 2010. Google ScholarDigital Library
- M. A. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, 84(406), 1989.Google Scholar
- A. Kannan, I. E. Givoni, R. Agrawal, and A. Fuxman. Matching unstructured product offers to structured product specifications. In Proc. of KDD, 2011. Google ScholarDigital Library
- L. Kolb, A. Thor, and E. Rahm. Dedoop: efficient deduplication with hadoop. PVLDB, 5(12):1878--1881, 2012. Google ScholarDigital Library
- H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1--2), 2010. Google ScholarDigital Library
- H. Köpcke, A. Thor, S. Thomas, and E. Rahm. Tailoring entity resolution for matching product offers. In Proc. of EDBT, 2012. Google ScholarDigital Library
- M. Lenzerini. Data integration: A theoretical perspective. In Proc. of PODS, 2002. Google ScholarDigital Library
- P. Li, X. L. Dong, A. Maurino, and D. Srivastava. Linking temporal records. PVLDB, 4(11), 2011.Google Scholar
- W. Lup Low, M. Li Lee, and T. Wang Ling. A knowledge-based approach for duplicate elimination in data cleaning. IS, 26(8), 2001. Google ScholarDigital Library
- E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 2000.Google Scholar
- V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, volume 1, 2001. Google ScholarDigital Library
- W. Su, J. Wang, and F. H. Lochovsky. Record matching over query results from multiple web databases. TKDE, 22(4), 2010. Google ScholarDigital Library
- J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11), 2012. Google ScholarDigital Library
- J. Wang, G. Li, J. X. Yu, and J. Feng. Entity matching: how similar is similar. PVLDB, 4(10), 2011. Google ScholarDigital Library
- S. E. Whang and H. Garcia-Molina. Joint entity resolution. In ICDE, 2012. Google ScholarDigital Library
Index Terms
- An analysis of duplicate on web extracted objects
Recommendations
Subsequent patient visit detection in a high volume OPD using record linkage techniques
COMPUTE '10: Proceedings of the Third Annual ACM Bangalore ConferenceRecord or data linkage techniques are used to link records which represent the same entity (e.g. patient, customer, citation, etc.) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be ...
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningMatching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant ...
Febrl: a freely available record linkage system with a graphical user interface
HDKM '08: Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80Record or data linkage is an important enabling technology in the health sector, as linked data is a cost-effective resource that can help to improve research into health policies, detect adverse drug reactions, reduce costs, and uncover fraud within the ...
Comments