skip to main content
10.1145/872757.872784acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Statistical schema matching across web query interfaces

Published:09 June 2003Publication History

ABSTRACT

Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. This paper proposes a different approach, motivated by integrating large numbers of data sources on the Internet. On this "deep Web," we observe two distinguishing characteristics that offer a new view for considering schema matching: First, as the Web scales, there are ample sources that provide structured information in the same domains (e.g., books and automobiles). Second, while sources proliferate, their aggregate schema vocabulary tends to converge at a relatively small size. Motivated by these observations, we propose a new paradigm, statistical schema matching: Unlike traditional approaches using pairwise-attribute correspondence, we take a holistic approach to match all input schemas by finding an underlying generative schema model. We propose a general statistical framework MGS for such hidden model discovery, which consists of hypothesis modeling, generation, and selection. Further, we specialize the general framework to develop Algorithm MGSsd, targeting at synonym discovery, a canonical problem of schema matching, by designing and discovering a model that specifically captures synonym attributes. We demonstrate our approach over hundreds of real Web sources in four domains and the results show good accuracy.

References

  1. C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4):323--364, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. K. Bergman. The deep web: Surfacing hidden value. Technical report, BrightPlanet LLC, Dec. 2000.Google ScholarGoogle Scholar
  3. P. Bickel and K. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics. Prentice Hall, 2001.Google ScholarGoogle Scholar
  4. K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Report UIUCDCS-R-2003-2321, Dept. of Computer Science, UIUC, Feb. 2003.Google ScholarGoogle Scholar
  5. W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms (Section Edition). MIT Press, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39:1--38, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  8. A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In SIGMOD 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Halevy, O. Etzioni, A. Doan, Z. Ives, J. Madhavan, L. McDowell, and I. Tatarinov. Crossing the structure chasm. Conf. on Innovative Database Research, 2003.Google ScholarGoogle Scholar
  10. B. He, T. Tao, C. Li, and K. C.-C. Chang. Clustering structured web sources: A schema-based, model-differentiation approach. Report UIUCDCS-R-2003-2322, Dept. of Computer Science, UIUC, Feb. 2003.Google ScholarGoogle Scholar
  11. J. Larson, S. Navathe, and R. Elmasri. A theory of attributed equivalence in databases with application to schema integration. IEEE Trans. on Software Engr., 16(4):449--463, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. J. Lloyd. Statistical Analysis of Categorical Data. Wiley, 1999.Google ScholarGoogle Scholar
  13. J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid. In VLDB 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Navathe and S. Gadgil. A methodology for view integration in logical data base design. In VLDB 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Ponte and W. Croft. A language modelling approach to information retrieval. In SIGIR 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Seligman, A. Rosenthal, P. Lehner, and A. Smith. Data integration: Where does the time go? Bulletin of the Tech. Committee on Data Engr., 25(3), 2002.Google ScholarGoogle Scholar

Index Terms

  1. Statistical schema matching across web query interfaces

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
        June 2003
        702 pages
        ISBN:158113634X
        DOI:10.1145/872757

        Copyright © 2003 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 June 2003

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        SIGMOD '03 Paper Acceptance Rate53of342submissions,15%Overall Acceptance Rate741of3,710submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader