SPHINX: Schema integration by example

Barbançon, Francois; Miranker, Daniel P.

doi:10.1007/s10844-006-0011-2

SPHINX: Schema integration by example

Published: 02 February 2007

Volume 29, pages 145–184, (2007)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Francois Barbançon¹ &
Daniel P. Miranker¹

98 Accesses
2 Citations
3 Altmetric
Explore all metrics

Abstract

The Internet has instigated a critical need for automated tools that facilitate integrating countless databases. Since nontechnical end users are often the ultimate repositories of the domain information required to distinguish differences in data types, an effective solution must integrate simple GUI based data browsing tools and automatic mapping methods that eliminate the requirement for a technical user to supervise the process. We develop a metamodel of data integration as the basis for absorbing feedback from an end user. The schema integration algorithm draws examples from the data and learns integrating view definitions by asking a user simple yes or no questions. The metamodel enables a search mechanism that is guaranteed to converge to a correct integrating view definition without the user having to know a view definition language such as SQL or SchemaSQL, or even having to inspect the final view definition. We show how data catalog statistics, normally used to optimize queries, can be exploited to parameterize the search heuristics and improve the convergence of the learning algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abiteboul, S., Cluet, S., & Milo, T. (1997). Correspondence and translation for heterogeneous data. ICDT Conference 1997 (pp. 351–363).
Baumgartner, R., Flesca, S., & Gottlob, G. (2001). Visual web information extraction with Lixto. VLDB Conference 2001 (pp. 119–128).
Castano S., & De Antonelli, V. (1999). A schema analysis and reconciliation tool environment. IDEAS Conference 1999 (pp. 53–62).
Clifton C., Housman, E., & Rosenthal, A. (1997). Experience with a combined approach to attribute-matching across heterogeneous databases. IFIP TC2/WG2.6 Seventh Conference on Database Semantics (DS-7) (pp. 428–452).
Cluet S., Delobel, C., Siméon, J., & Smaga, K. (1998). Your mediators need data conversion! SIGMOD Conference 1998 (pp. 177–188).
Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15(2), 201–221.
Google Scholar
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner. Towards automatic data extraction from large web sites. VLDB Conference 2001 (pp. 109–118) .
Dagan, I., & Engelson, S. (1995). Committee-based sampling for training probabilistic classifiers. Proceedings of the Twelfth International Conference on Machine Learning (pp. 150–157).
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex mappings between database schemas. SIGMOD Conference 2004 (pp. 383–394).
Doan, A., Domingos, P., & Halevy, A. (2001). Reconciling schemas of disparate data sources: A machine learning approach. SIGMOD Conference 2001.
Florescu, D., Levy, A., & Mendelzon, A. (1998). Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3), 59–74.
Article Google Scholar
Garcia-Molina H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., et al. (1997). The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 8(2), 117–132.
Article Google Scholar
Grannis, S., Overhage, J., Hui, S., & McDonald, C. (2002). Analysis of identifier performance using a deterministic linkage algorithm. JAMIA (Symposium Supplement) Proceedings of the American Medical Informatics Association Annual Symposium (pp. 305–309).
Grannis, S., Overhage, J., Hui, S., McDonald, C. (2003). Analysis of a probabilistic record linkage technique without human review. JAMIA (Symposium Supplement) Proceedings of the American Medical Informatics Association Annual Symposium (pp. 259–263).
Haas, L., Kossman, D., Wimmers, E., & Yang, J. (1997). Optimizing queries across diverse data sources. VLDB Conference 1997 (pp. 276–285).
Haussler, D. (1988). Quantifying inductive bias: AI learning algorithms and valiant’s learning framework. Artificial Intelligence, 36(2), 177–221.
Article MATH MathSciNet Google Scholar
Hirsh, H. (1991). Theoretical underpinnings of version spaces. IJCAI Conference 1991 (pp. 665–670).
Hirsh, H. (1994). Generalizing version spaces. Machine Learning, 17(1), 5–46.
MATH Google Scholar
Idemstam-Almquist, P. (1990). Demand networks: An alternative representation of version spaces. Master’s thesis, Department of Computer Science and Systems Sciences, The Royal Institute of Technology and Stockholm University, Stockholm, Sweden.
Kent, W. (1991). Solving domain mismatch and schema mismatch problems with an object-oriented database programming language. VLDB Conference 1991 (pp. 147–160).
Kent, W. (1992). Profile functions and bag theory. Palo Alto: Hewlett-Packard.
Google Scholar
Krishnamurthy, R., Litwin, W., & Kent, W. (1991). Language features for interoperability of databases with schematic discrepancies. SIGMOD Conference 1991 (pp. 40–49).
Lakshmanan, L., Sadri, F., & Subramanian, I. (1996). SchemaSQL—A language for interoperability in relational multi-database systems. VLDB Conference 1996 (pp. 239–250).
Lau, T., Wolfman S., Domingos, P., & Weld, D. (2003). Programming by demonstration using version space algebra. Machine Learning, 53(1–2), 111–156.
Article MATH Google Scholar
Lesh, N., & Etzioni, O. (1996). Scaling up goal recognition. Proceedings of the Fifth International Conference on Principles of Knowledge Representation and Reasoning (KR’96) (pp. 244–255).
Levy, A., Rajaraman, A., & Ordille, J. (1996). Querying heterogeneous information sources using source descriptions. VLDB Conference 1996 (pp. 251–262).
Lewis, D., & Catlett, J. (1994). Heterogenous uncertainty sampling for supervised learning. Proceedings of the Eleventh International Conference on Machine Learning (pp. 148–156).
Li W., Clifton, C., & Liu, S. (2000). SemInt: A tool for identifying attribute correspondences in heterogeneous databases using neural network. Data and Knowledge Engineering, 33(1), 49–84.
Article MATH Google Scholar
MacKay, D. (1992). Information-based objective functions for active data selection, Neural Computation, 4(4), 590–604.
Article Google Scholar
Madhavan J., Bernstein, P., & Rahm, E. (2001). Generic schema matching with cupid. VLDB Conference 2001 (pp. 49–58).
Miller R., Haas, L., & Hernández, M. (2000). Schema mapping as query discovery. VLDB Conference 2000 (pp. 77–88).
Milo, T., & Zohar, S. (1998). Using schema matching to simplify heterogeneous data translation. VLDB Conference 1998 (pp. 122–133).
Mitchell, T. (1977). Version spaces: A candidate elimination approach to rule learning. IJCAI Conference 1977 (pp. 305–310).
Mitchell, T. (1978). Version spaces: An approach to concept learning (Stanford CS report STAN-CS-78-711, HPP79-2). PhD thesis, Stanford University, Stanford, CT, December 1978.
Mitra P., Wiederhold, G., & Kersten, M. (2000). A graph-oriented model for articulation of ontology interdependencies. EDBT Conference 2000 (pp. 86–100).
Muslea, I., Minton, S., & Knoblock, C. (2000). Selective sampling with redundant views. Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence (pp. 621–626).
Palopoli, L., Terracina, G., & Ursino, D. (2000). The system DIKE: Towards the semi-automatic synthesis of cooperative information systems and data warehouses. ADBIS-DASFAA Conference 2000 (pp. 108–117).
Park, Y., Han, Y., & Choi, K. (1995). Automatic thesaurus construction using Bayesian networks. CIKM Conference 1995 (pp. 212–217).
Popescu, A., Etzioni, O., & Kautz, H. (2003). Towards a theory of natural language interfaces to databases. International Conference on Intelligent User Interfaces (pp. 149–157).
Rahm, E., & Bernstein, P. (2001). A survey of approaches to automatic schema matching. VLDB Journal, 10(4), 334–350.
Article MATH Google Scholar
Scheuermann, P., Li, W.-S., & Clifton, C. (1998). Multidatabase query processing with uncertainty in global keys and attribute values. Journal of the American Society for Information Science, 49(3), 283–301.
Article Google Scholar
Seung, H., Opper, M., & Sompolinsky, H. (1992). Query by committee. Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory (pp. 287–294).
Sheth, A., & Larson, J. (1990). Federated database systems for managing distributed heterogeneous and autonomous databases. ACM Computing Surveys, 22(3), 183–236.
Article Google Scholar
Smirnov, E. (2001). Conjunctive and disjunctive version spaces with instance-based boundary sets. PhD thesis, Dept. of Computer Science, Maastricht University, Maastricht, The Netherlands.
Takenobu, T., Makoto, I., & Hozumi, T. (1995). Automatic thesaurus construction based on grammatical relations. IJCAI Conference 1995 (pp. 1308–1313).
Thompson, C., Califf, M., & Mooney, R. (1999). Active learning for natural language parsing and information extraction. Proceedings of the Sixteenth International Conference on Machine Learning (pp. 406–414).
Tomasic, A., Raschid, L., & Valduriez, P. (1996). Scaling heterogeneous databases and the design of disco. ICDCS Conference 1996 (pp. 449–457).
Vassalos, V., & Papakonstantinou, Y. (1997). Describing and using query capabilities of heterogeneous sources. VLDB Conference 1997 (pp. 256–265).
Vidal, M., Raschid, L., & Gruser, J. (1998). A meta-wrapper for scaling up to multiple autonomous distributed information sources. CoopIS 1998 (pp. 148–157).
Yan, L., Miller, R., Haas, L., & Fagin, R. (2001). Data driven understanding and refinement of schema mappings. SIGMOD Conference 2001.
Yan, L., Özsu, M., & Liu, L. (1997). Accessing heterogeneous data through homogenization and integration mediators. CoopIS 1997 (pp. 130–139).
Zloof, M. (1977). Query-by-example: A data base language. IBM Systems Journal, 16, 324–343.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Sciences, University of Texas, 1 University Station C0500, Austin, TX, 78712-0233, USA
Francois Barbançon & Daniel P. Miranker

Authors

Francois Barbançon
View author publications
You can also search for this author in PubMed Google Scholar
Daniel P. Miranker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francois Barbançon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barbançon, F., Miranker, D.P. SPHINX: Schema integration by example. J Intell Inf Syst 29, 145–184 (2007). https://doi.org/10.1007/s10844-006-0011-2

Download citation

Received: 03 December 2003
Revised: 07 July 2005
Accepted: 01 November 2005
Published: 02 February 2007
Issue Date: October 2007
DOI: https://doi.org/10.1007/s10844-006-0011-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SPHINX: Schema integration by example

Abstract

Access this article

Similar content being viewed by others

A New Framework for Designing Schema Mappings

Tabular Web Data: Schema Discovery and Integration

A Global Model-Driven Denormalization Approach for Schema Migration

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SPHINX: Schema integration by example

Abstract

Access this article

Similar content being viewed by others

A New Framework for Designing Schema Mappings

Tabular Web Data: Schema Discovery and Integration

A Global Model-Driven Denormalization Approach for Schema Migration

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation