Article

Statistical schema matching across web query interfaces

Authors:
Bin He

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Kevin Chen-Chuan Chang

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of dataJune 2003Pages 217–228https://doi.org/10.1145/872757.872784

Published:09 June 2003Publication History

SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data

Pages 217–228

ABSTRACT

Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. This paper proposes a different approach, motivated by integrating large numbers of data sources on the Internet. On this "deep Web," we observe two distinguishing characteristics that offer a new view for considering schema matching: First, as the Web scales, there are ample sources that provide structured information in the same domains (e.g., books and automobiles). Second, while sources proliferate, their aggregate schema vocabulary tends to converge at a relatively small size. Motivated by these observations, we propose a new paradigm, statistical schema matching: Unlike traditional approaches using pairwise-attribute correspondence, we take a holistic approach to match all input schemas by finding an underlying generative schema model. We propose a general statistical framework MGS for such hidden model discovery, which consists of hypothesis modeling, generation, and selection. Further, we specialize the general framework to develop Algorithm MGS_sd, targeting at synonym discovery, a canonical problem of schema matching, by designing and discovering a model that specifically captures synonym attributes. We demonstrate our approach over hundreds of real Web sources in four domains and the results show good accuracy.

References

C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4):323--364, 1986. Google ScholarDigital Library
M. K. Bergman. The deep web: Surfacing hidden value. Technical report, BrightPlanet LLC, Dec. 2000.Google Scholar
P. Bickel and K. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics. Prentice Hall, 2001.Google Scholar
K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Report UIUCDCS-R-2003-2321, Dept. of Computer Science, UIUC, Feb. 2003.Google Scholar
W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD 1998. Google ScholarDigital Library
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms (Section Edition). MIT Press, 2001. Google ScholarDigital Library
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39:1--38, 1977.Google ScholarCross Ref
A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In SIGMOD 2001. Google ScholarDigital Library
A. Halevy, O. Etzioni, A. Doan, Z. Ives, J. Madhavan, L. McDowell, and I. Tatarinov. Crossing the structure chasm. Conf. on Innovative Database Research, 2003.Google Scholar
B. He, T. Tao, C. Li, and K. C.-C. Chang. Clustering structured web sources: A schema-based, model-differentiation approach. Report UIUCDCS-R-2003-2322, Dept. of Computer Science, UIUC, Feb. 2003.Google Scholar
J. Larson, S. Navathe, and R. Elmasri. A theory of attributed equivalence in databases with application to schema integration. IEEE Trans. on Software Engr., 16(4):449--463, 1989. Google ScholarDigital Library
C. J. Lloyd. Statistical Analysis of Categorical Data. Wiley, 1999.Google Scholar
J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid. In VLDB 2001. Google ScholarDigital Library
S. Navathe and S. Gadgil. A methodology for view integration in logical data base design. In VLDB 1982. Google ScholarDigital Library
J. Ponte and W. Croft. A language modelling approach to information retrieval. In SIGIR 1998. Google ScholarDigital Library
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001. Google ScholarDigital Library
L. Seligman, A. Rosenthal, P. Lehner, and A. Smith. Data integration: Where does the time go? Bulletin of the Tech. Committee on Data Engr., 25(3), 2002.Google Scholar

Index Terms

Statistical schema matching across web query interfaces
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Pattern matching

Recommendations

Schema Matching across Query Interfaces on the Deep Web
BNCOD '08: Proceedings of the 25th British national conference on Databases: Sharing Data, Information and Knowledge

Schema matching is a crucial step in data integration. Many approaches to schema matching have been proposed so far. Different types of information about schemas, including structures, linguistic features and data types, etc have been used to match ...
Read More
Automatic complex schema matching across Web query interfaces: A correlation mining approach

To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing ...
Read More
A schema matching-based approach to XML schema clustering
iiWAS '08: Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services

The relationship between XML data clustering and schema matching is bidirectional. On one side, clustering techniques have been adopted to improve matching performance, and on the other side schema matching is the backbone of the clustering technique. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
June 2003
702 pages
ISBN:158113634X
DOI:10.1145/872757
Conference Chair:
Zachary Ives
University of Pennsylvania
,
General Chair:
Yannis Papakonstantinou
University of California, San Diego
,
Program Chair:
Alon Halevy
University of Washington
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SIGMOD '03 Paper Acceptance Rate53of342submissions,15%Overall Acceptance Rate741of3,710submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 214
  Total Citations
  View Citations
- 800
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Statistical schema matching across web query interfaces

SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Schema Matching across Query Interfaces on the Deep Web

Automatic complex schema matching across Web query interfaces: A correlation mining approach

A schema matching-based approach to XML schema clustering