skip to main content
10.1145/1007568.1007582acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

An interactive clustering-based approach to integrating source query interfaces on the deep Web

Published: 13 June 2004 Publication History

Abstract

An increasing number of data sources now become available on the Web, but often their contents are only accessible through query interfaces. For a domain of interest, there often exist many such sources with varied coverage or querying capabilities. As an important step to the integration of these sources, we consider the integration of their query interfaces. More specifically, we focus on the crucial step of the integration: accurately matching the interfaces. While the integration of query interfaces has received more attentions recently, current approaches are not sufficiently general: (a) they all model interfaces with flat schemas; (b) most of them only consider 1:1 mappings of fields over the interfaces; (c) they all perform the integration in a blackbox-like fashion and the whole process has to be restarted from scratch if anything goes wrong; and (d) they often require laborious parameter tuning. In this paper, we propose an interactive, clustering-based approach to matching query interfaces. The hierarchical nature of interfaces is captured with ordered trees. Varied types of complex mappings of fields are examined and several approaches are proposed to effectively identify these mappings. We put the human integrator back in the loop and propose several novel approaches to the interactive learning of parameters and the resolution of uncertain mappings. Extensive experiments are conducted and results show that our approach is highly effective.

References

[1]
IceQ project: http://hanoi.cs.uiuc.edu/iceq/.
[2]
http://metaquerier.cs.uiuc.edu/repository/.
[3]
M. Bergman. The Deep Web: Surfacing the hidden value. BrightPlanet.com, 2000.
[4]
W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM TOIS, 18(3), 2000.
[5]
L. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3), 1945.
[6]
H. Do and E. Rahm. Coma - a system for flexible combination of schema matching approaches. In VLDB, 2002.
[7]
A. Doan, P. Domingos, and A. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In SIGMOD, 2001.
[8]
C. Fellbaum, editor. WordNet: An On-Line Lexical Database and Some of its Applications. MIT Press, Cambridge, MA, 1998.
[9]
A. Halevy and J. Madhavan. Corpus-based knowledge representation. In Int. Joint Conf. on AI, 2003.
[10]
B. He and K. Chang. Statistical schema matching across Web query interfaces. In SIGMOD, 2003.
[11]
H. He, W. Meng, C. Yu, and Z. Wu. Wise-integrator: an automatic integrator of Web search interfaces for e-commerce. In VLDB, 2003.
[12]
A. Hess and N. Kushmerick. Automatically attaching semantic metadata to Web services. In IJCAI Workshop on Information Integration on the Web, 2003.
[13]
L. Kaufman and P. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
[14]
J. Larson, S. Navathe, and R. Elmasri. A theory of attributed equivalence in databases with application to schema integration. IEEE Trans. on Software Engineering, 15(4), 1989.
[15]
S. Lawrence and C. Giles. Accessibility of information on the Web. Nature, 400, 1999.
[16]
W. Li and C. Clifton. Semint: A tool for identifying attribute correspondence in heterogeneous databases using neural networks. Data & Knowledge Engineering, 33(1), 2000.
[17]
L. Lovasz and M. Plummer. Matching Theory. North-Holland, Amsterdam, 1986.
[18]
J. Madhavan, P. Bernstein, and E. Rahm. Generic schema matching with cupid. In VLDB, 2001.
[19]
S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In ICDE, 2002.
[20]
R. Miller, L. Haas, and M. Hernandez. Schema mapping as query discovery. In VLDB, 2000.
[21]
T. Mitchell. Machine Learning. McGraw-Hill, 1997.
[22]
M. Porter. An algorithm for suffix stripping. Program, 14(3), 1980.
[23]
R. Pottinger and P. Bernstein. Merging models based on given correspondences. In VLDB, 2003.
[24]
S. Raghavan and H. Garcia-Molina. Crawling the hidden Web. In VLDB, 2001.
[25]
E. Rahm and P. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4), 2001.
[26]
V. Raman and J. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, 2001.
[27]
G. Salton and M. McGill. Introduction to Modern Information Retrieval. McCraw-Hill, New York, 1983.
[28]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Int. Conf. on Knowledge Discovery & Data Mining, 2002.
[29]
A. Sheth and J. Larson. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys, 22(3), 1990.
[30]
S. Tejada, C. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems, 26(8), 2001.
[31]
C. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.
[32]
S. Zelikovitz and H. Hirsh. Improving short-text classification using unlabeled background knowledge to assess document similarity. In Int. Conf. on Machine Learning, 2000.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data
June 2004
988 pages
ISBN:1581138598
DOI:10.1145/1007568
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2004

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)2
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)FLASHRobotic Process Automation10.1002/9781394166954.ch5(61-100)Online publication date: 5-Sep-2023
  • (2022)Semantic web based platform for the harmonization of teacher education curriculaComputer Science and Information Systems10.2298/CSIS210207050M19:1(229-250)Online publication date: 2022
  • (2022)PACkProceedings of the VLDB Endowment10.14778/3514061.351406215:6(1132-1145)Online publication date: 22-Jun-2022
  • (2021)Dependency-aware Form Understanding2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE52982.2021.00026(139-149)Online publication date: Oct-2021
  • (2021)ELEMENT: Text Extraction for the Dark WebAdvanced Computing and Intelligent Technologies10.1007/978-981-16-2164-2_43(537-551)Online publication date: 22-Jul-2021
  • (2021)SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema MatchingAdvances in Databases and Information Systems10.1007/978-3-030-82472-3_19(260-274)Online publication date: 24-Aug-2021
  • (2020)Survey on complex ontology matchingSemantic Web10.3233/SW-19036611:4(689-727)Online publication date: 1-Jan-2020
  • (2020)Generic schema matching, ten years laterProceedings of the VLDB Endowment10.14778/3402707.34027104:11(695-701)Online publication date: 3-Jun-2020
  • (2020)ADnEVProceedings of the VLDB Endowment10.14778/3397230.339723713:9(1401-1415)Online publication date: 26-Jun-2020
  • (2020)Interactive ClusteringACM Computing Surveys10.1145/334096053:1(1-39)Online publication date: 6-Feb-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media