ABSTRACT
Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. This paper proposes a different approach, motivated by integrating large numbers of data sources on the Internet. On this "deep Web," we observe two distinguishing characteristics that offer a new view for considering schema matching: First, as the Web scales, there are ample sources that provide structured information in the same domains (e.g., books and automobiles). Second, while sources proliferate, their aggregate schema vocabulary tends to converge at a relatively small size. Motivated by these observations, we propose a new paradigm, statistical schema matching: Unlike traditional approaches using pairwise-attribute correspondence, we take a holistic approach to match all input schemas by finding an underlying generative schema model. We propose a general statistical framework MGS for such hidden model discovery, which consists of hypothesis modeling, generation, and selection. Further, we specialize the general framework to develop Algorithm MGSsd, targeting at synonym discovery, a canonical problem of schema matching, by designing and discovering a model that specifically captures synonym attributes. We demonstrate our approach over hundreds of real Web sources in four domains and the results show good accuracy.
- C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4):323--364, 1986. Google ScholarDigital Library
- M. K. Bergman. The deep web: Surfacing hidden value. Technical report, BrightPlanet LLC, Dec. 2000.Google Scholar
- P. Bickel and K. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics. Prentice Hall, 2001.Google Scholar
- K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Report UIUCDCS-R-2003-2321, Dept. of Computer Science, UIUC, Feb. 2003.Google Scholar
- W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD 1998. Google ScholarDigital Library
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms (Section Edition). MIT Press, 2001. Google ScholarDigital Library
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39:1--38, 1977.Google ScholarCross Ref
- A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In SIGMOD 2001. Google ScholarDigital Library
- A. Halevy, O. Etzioni, A. Doan, Z. Ives, J. Madhavan, L. McDowell, and I. Tatarinov. Crossing the structure chasm. Conf. on Innovative Database Research, 2003.Google Scholar
- B. He, T. Tao, C. Li, and K. C.-C. Chang. Clustering structured web sources: A schema-based, model-differentiation approach. Report UIUCDCS-R-2003-2322, Dept. of Computer Science, UIUC, Feb. 2003.Google Scholar
- J. Larson, S. Navathe, and R. Elmasri. A theory of attributed equivalence in databases with application to schema integration. IEEE Trans. on Software Engr., 16(4):449--463, 1989. Google ScholarDigital Library
- C. J. Lloyd. Statistical Analysis of Categorical Data. Wiley, 1999.Google Scholar
- J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid. In VLDB 2001. Google ScholarDigital Library
- S. Navathe and S. Gadgil. A methodology for view integration in logical data base design. In VLDB 1982. Google ScholarDigital Library
- J. Ponte and W. Croft. A language modelling approach to information retrieval. In SIGIR 1998. Google ScholarDigital Library
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001. Google ScholarDigital Library
- L. Seligman, A. Rosenthal, P. Lehner, and A. Smith. Data integration: Where does the time go? Bulletin of the Tech. Committee on Data Engr., 25(3), 2002.Google Scholar
Index Terms
- Statistical schema matching across web query interfaces
Recommendations
Schema Matching across Query Interfaces on the Deep Web
BNCOD '08: Proceedings of the 25th British national conference on Databases: Sharing Data, Information and KnowledgeSchema matching is a crucial step in data integration. Many approaches to schema matching have been proposed so far. Different types of information about schemas, including structures, linguistic features and data types, etc have been used to match ...
Automatic complex schema matching across Web query interfaces: A correlation mining approach
To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing ...
A schema matching-based approach to XML schema clustering
iiWAS '08: Proceedings of the 10th International Conference on Information Integration and Web-based Applications & ServicesThe relationship between XML data clustering and schema matching is bidirectional. On one side, clustering techniques have been adopted to improve matching performance, and on the other side schema matching is the backbone of the clustering technique. ...
Comments