Abstract
The Internet has instigated a critical need for automated tools that facilitate integrating countless databases. Since nontechnical end users are often the ultimate repositories of the domain information required to distinguish differences in data types, an effective solution must integrate simple GUI based data browsing tools and automatic mapping methods that eliminate the requirement for a technical user to supervise the process. We develop a metamodel of data integration as the basis for absorbing feedback from an end user. The schema integration algorithm draws examples from the data and learns integrating view definitions by asking a user simple yes or no questions. The metamodel enables a search mechanism that is guaranteed to converge to a correct integrating view definition without the user having to know a view definition language such as SQL or SchemaSQL, or even having to inspect the final view definition. We show how data catalog statistics, normally used to optimize queries, can be exploited to parameterize the search heuristics and improve the convergence of the learning algorithm.
Similar content being viewed by others
References
Abiteboul, S., Cluet, S., & Milo, T. (1997). Correspondence and translation for heterogeneous data. ICDT Conference 1997 (pp. 351–363).
Baumgartner, R., Flesca, S., & Gottlob, G. (2001). Visual web information extraction with Lixto. VLDB Conference 2001 (pp. 119–128).
Castano S., & De Antonelli, V. (1999). A schema analysis and reconciliation tool environment. IDEAS Conference 1999 (pp. 53–62).
Clifton C., Housman, E., & Rosenthal, A. (1997). Experience with a combined approach to attribute-matching across heterogeneous databases. IFIP TC2/WG2.6 Seventh Conference on Database Semantics (DS-7) (pp. 428–452).
Cluet S., Delobel, C., Siméon, J., & Smaga, K. (1998). Your mediators need data conversion! SIGMOD Conference 1998 (pp. 177–188).
Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15(2), 201–221.
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner. Towards automatic data extraction from large web sites. VLDB Conference 2001 (pp. 109–118) .
Dagan, I., & Engelson, S. (1995). Committee-based sampling for training probabilistic classifiers. Proceedings of the Twelfth International Conference on Machine Learning (pp. 150–157).
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex mappings between database schemas. SIGMOD Conference 2004 (pp. 383–394).
Doan, A., Domingos, P., & Halevy, A. (2001). Reconciling schemas of disparate data sources: A machine learning approach. SIGMOD Conference 2001.
Florescu, D., Levy, A., & Mendelzon, A. (1998). Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3), 59–74.
Garcia-Molina H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., et al. (1997). The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 8(2), 117–132.
Grannis, S., Overhage, J., Hui, S., & McDonald, C. (2002). Analysis of identifier performance using a deterministic linkage algorithm. JAMIA (Symposium Supplement) Proceedings of the American Medical Informatics Association Annual Symposium (pp. 305–309).
Grannis, S., Overhage, J., Hui, S., McDonald, C. (2003). Analysis of a probabilistic record linkage technique without human review. JAMIA (Symposium Supplement) Proceedings of the American Medical Informatics Association Annual Symposium (pp. 259–263).
Haas, L., Kossman, D., Wimmers, E., & Yang, J. (1997). Optimizing queries across diverse data sources. VLDB Conference 1997 (pp. 276–285).
Haussler, D. (1988). Quantifying inductive bias: AI learning algorithms and valiant’s learning framework. Artificial Intelligence, 36(2), 177–221.
Hirsh, H. (1991). Theoretical underpinnings of version spaces. IJCAI Conference 1991 (pp. 665–670).
Hirsh, H. (1994). Generalizing version spaces. Machine Learning, 17(1), 5–46.
Idemstam-Almquist, P. (1990). Demand networks: An alternative representation of version spaces. Master’s thesis, Department of Computer Science and Systems Sciences, The Royal Institute of Technology and Stockholm University, Stockholm, Sweden.
Kent, W. (1991). Solving domain mismatch and schema mismatch problems with an object-oriented database programming language. VLDB Conference 1991 (pp. 147–160).
Kent, W. (1992). Profile functions and bag theory. Palo Alto: Hewlett-Packard.
Krishnamurthy, R., Litwin, W., & Kent, W. (1991). Language features for interoperability of databases with schematic discrepancies. SIGMOD Conference 1991 (pp. 40–49).
Lakshmanan, L., Sadri, F., & Subramanian, I. (1996). SchemaSQL—A language for interoperability in relational multi-database systems. VLDB Conference 1996 (pp. 239–250).
Lau, T., Wolfman S., Domingos, P., & Weld, D. (2003). Programming by demonstration using version space algebra. Machine Learning, 53(1–2), 111–156.
Lesh, N., & Etzioni, O. (1996). Scaling up goal recognition. Proceedings of the Fifth International Conference on Principles of Knowledge Representation and Reasoning (KR’96) (pp. 244–255).
Levy, A., Rajaraman, A., & Ordille, J. (1996). Querying heterogeneous information sources using source descriptions. VLDB Conference 1996 (pp. 251–262).
Lewis, D., & Catlett, J. (1994). Heterogenous uncertainty sampling for supervised learning. Proceedings of the Eleventh International Conference on Machine Learning (pp. 148–156).
Li W., Clifton, C., & Liu, S. (2000). SemInt: A tool for identifying attribute correspondences in heterogeneous databases using neural network. Data and Knowledge Engineering, 33(1), 49–84.
MacKay, D. (1992). Information-based objective functions for active data selection, Neural Computation, 4(4), 590–604.
Madhavan J., Bernstein, P., & Rahm, E. (2001). Generic schema matching with cupid. VLDB Conference 2001 (pp. 49–58).
Miller R., Haas, L., & Hernández, M. (2000). Schema mapping as query discovery. VLDB Conference 2000 (pp. 77–88).
Milo, T., & Zohar, S. (1998). Using schema matching to simplify heterogeneous data translation. VLDB Conference 1998 (pp. 122–133).
Mitchell, T. (1977). Version spaces: A candidate elimination approach to rule learning. IJCAI Conference 1977 (pp. 305–310).
Mitchell, T. (1978). Version spaces: An approach to concept learning (Stanford CS report STAN-CS-78-711, HPP79-2). PhD thesis, Stanford University, Stanford, CT, December 1978.
Mitra P., Wiederhold, G., & Kersten, M. (2000). A graph-oriented model for articulation of ontology interdependencies. EDBT Conference 2000 (pp. 86–100).
Muslea, I., Minton, S., & Knoblock, C. (2000). Selective sampling with redundant views. Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence (pp. 621–626).
Palopoli, L., Terracina, G., & Ursino, D. (2000). The system DIKE: Towards the semi-automatic synthesis of cooperative information systems and data warehouses. ADBIS-DASFAA Conference 2000 (pp. 108–117).
Park, Y., Han, Y., & Choi, K. (1995). Automatic thesaurus construction using Bayesian networks. CIKM Conference 1995 (pp. 212–217).
Popescu, A., Etzioni, O., & Kautz, H. (2003). Towards a theory of natural language interfaces to databases. International Conference on Intelligent User Interfaces (pp. 149–157).
Rahm, E., & Bernstein, P. (2001). A survey of approaches to automatic schema matching. VLDB Journal, 10(4), 334–350.
Scheuermann, P., Li, W.-S., & Clifton, C. (1998). Multidatabase query processing with uncertainty in global keys and attribute values. Journal of the American Society for Information Science, 49(3), 283–301.
Seung, H., Opper, M., & Sompolinsky, H. (1992). Query by committee. Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory (pp. 287–294).
Sheth, A., & Larson, J. (1990). Federated database systems for managing distributed heterogeneous and autonomous databases. ACM Computing Surveys, 22(3), 183–236.
Smirnov, E. (2001). Conjunctive and disjunctive version spaces with instance-based boundary sets. PhD thesis, Dept. of Computer Science, Maastricht University, Maastricht, The Netherlands.
Takenobu, T., Makoto, I., & Hozumi, T. (1995). Automatic thesaurus construction based on grammatical relations. IJCAI Conference 1995 (pp. 1308–1313).
Thompson, C., Califf, M., & Mooney, R. (1999). Active learning for natural language parsing and information extraction. Proceedings of the Sixteenth International Conference on Machine Learning (pp. 406–414).
Tomasic, A., Raschid, L., & Valduriez, P. (1996). Scaling heterogeneous databases and the design of disco. ICDCS Conference 1996 (pp. 449–457).
Vassalos, V., & Papakonstantinou, Y. (1997). Describing and using query capabilities of heterogeneous sources. VLDB Conference 1997 (pp. 256–265).
Vidal, M., Raschid, L., & Gruser, J. (1998). A meta-wrapper for scaling up to multiple autonomous distributed information sources. CoopIS 1998 (pp. 148–157).
Yan, L., Miller, R., Haas, L., & Fagin, R. (2001). Data driven understanding and refinement of schema mappings. SIGMOD Conference 2001.
Yan, L., Özsu, M., & Liu, L. (1997). Accessing heterogeneous data through homogenization and integration mediators. CoopIS 1997 (pp. 130–139).
Zloof, M. (1977). Query-by-example: A data base language. IBM Systems Journal, 16, 324–343.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Barbançon, F., Miranker, D.P. SPHINX: Schema integration by example. J Intell Inf Syst 29, 145–184 (2007). https://doi.org/10.1007/s10844-006-0011-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-006-0011-2