Abstract
Web-scale data integration involves fully automated efforts which lack knowledge of the exact match between data descriptions. In this paper, we introduce schema matching prediction, an assessment mechanism to support schema matchers in the absence of an exact match. Given attribute pair-wise similarity measures, a predictor predicts the success of a matcher in identifying correct correspondences. We present a comprehensive framework in which predictors can be defined, designed, and evaluated. We formally define schema matching evaluation and schema matching prediction using similarity spaces and discuss a set of four desirable properties of predictors, namely correlation, robustness, tunability, and generalization. We present a method for constructing predictors, supporting generalization, and introduce prediction models as means of tuning prediction toward various quality measures. We define the empirical properties of correlation and robustness and provide concrete measures for their evaluation. We illustrate the usefulness of schema matching prediction by presenting three use cases: We propose a method for ranking the relevance of deep Web sources with respect to given user needs. We show how predictors can assist in the design of schema matching systems. Finally, we show how prediction can support dynamic weight setting of matchers in an ensemble, thus improving upon current state-of-the-art weight setting methods. An extensive empirical evaluation shows the usefulness of predictors in these use cases and demonstrates the usefulness of prediction models in increasing the performance of schema matching.
Similar content being viewed by others
Notes
References
Batini, C., Lenzerini, M., Navathe, S.B.: A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. (CSUR) 18(4), 323–364 (1986)
Bellahsene, Z.: Schema Matching and Mapping. Springer, New York (2011)
Bergamaschi, S., Castano, S., Vincini, M., Beneventano, D.: Semantic integration of heterogeneous information sources. Data Knowl. Eng. 36(3), 215–249 (2001)
Bernstein, P.A., Melnik, S.: Meta data management. In: ICDE, p. 875. IEEE (2004)
Castano, S., Antonellis, V.D.: Global viewing of heterogeneous data sources. IEEE Trans. Knowl. Data Eng. 13(2), 277–297 (2001)
Cheng, R., Gong, J., Cheung, D.: Managing uncertainty of XML schema matching. In: Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pp. 297–308 (2010)
Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum, Hillsdale (1988)
Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 861–874, New York, NY, USA, ACM (2008)
Do, H.-H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In: Chaudhri, A., Jeckle, M., Rahm, E., Unland, R. (eds.) Web, Web-Services, and Database Systems, vol. 2593, LNCS, pp. 221–237. Springer, Berlin (2003)
Do, H.H., Rahm, E.: Coma: a system for flexible combination of schema matching approaches. In: Proceedings of VLDB, pp. 610–621. VLDB Endowment (2002)
Doan, A.H., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: ACM SIGMOD Record, vol. 30, pp. 509–520. ACM (2001)
Doan, A.H., Madhavan, J., Domingos, P., Halevy, A.: Learning to map between ontologies on the semantic web. In: Proceedings of the 11th International Conference on World Wide Web, pp. 662–673. ACM Press (2002)
Dong, X., Halevy, A., Yu, C.: Data integration with uncertainty. VLDB J. 18, 469–500 (2009)
dos Santos Mello, R., Castano, S., Heuser, C.A.: A method for the unification of xml schemata. Inform. Softw. Technol. 44(4), 241–249 (2002)
Draper, N., Smith, H.: Applied Regression Analysis, 2nd edn. Wiley, New York (1981)
Euzenat, J.: Semantic precision and recall for ontology alignment evaluation. In: Proc. IJCAI, pp. 348–353 (2007)
Franklin, M., Halevy, A., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)
Gal, A.: Uncertain schema matching. Synth. Lect. Data Manag. 3(1), 1–97 (2011)
Gal, A., Anaby-Tavor, A., Trombetta, A., Montesi, D.: A framework for modeling and evaluating automatic semantic reconciliation. VLDB J. 14(1), 50–67 (2005)
Gal, A., Modica, G., Jamil, H., Eyal, A.: Automatic ontology matching using application semantics. AI Mag. 26(1), 21 (2005)
Gal, A., Sagi, T.: Tuning the ensemble selection process of schema matchers. Inform. Syst. 35(8), 845–859 (2010)
He, B., Chang, K.C.-C.: Statistical schema matching across web query interfaces. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD ’03, pp. 217–228, New York, NY, USA, ACM (2003)
Kang, J., Naughton, J.F.: On schema matching with opaque column names and data values. In: International Conference on Management of Data: Proceedings of the 2003 ACM SIGMOD International Conference on Management of data, vol. 9, pp. 205–216 (2003)
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: Xclust: clustering xml schemas for effective integration. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, CIKM ’02, pp. 292–299, New York, NY, USA, ACM (2002)
Lee, Y., Sayyadian, M., Doan, A.H., Rosenthal, A.S.: eTuner: tuning schema matching software using synthetic scenarios. VLDB J. 16(1), 97–122 (2007)
Luenberger, D.: Optimization by Vector Space Methods. Wiley-Interscience, New York (1997)
Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based schema matching. In: Proc. ICDE, pp. 57–68, April (2005)
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 49–58 (2001)
Madhavan, J., Jeffery, S., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A.: Web-scale data integration: You can only afford to pay as you go. In: Proceedings of CIDR, pp. 342–350 (2007)
Magnani, M., Rizopoulos, N., McBrien, P., Montesi, D.: Schema integration based on uncertain semantic mappings. In: Conceptual Modeling ER 2005, pp. 31–46 (2005)
Mao, M., Peng, Y. Spring, M.: A harmony based adaptive ontology mapping approach. In: Proc. of SWWS (2008)
Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: ICDE, pp. 117–128. IEEE (2002)
Meo, P.D., Quattrone, G., Terracina, G., Ursino, D.: Integration of xml schemas at various severity levels. Inform. Syst. 31(6), 397–434 (2006)
Miles, J., Shevlin, M.: Applying Regression and Correlation: A Guide for Students and Researchers. Sage, London (2001)
Miller, R.J., Hernandez, M.A., Haas, L.M., Yan, L.-L., Ho, C.T.H., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIgMOD Rec. 30(1), 78–83 (2001)
Ngo, D.H., Bellahsene, Z.: Evaluating the Interaction between the different Matchers (or Strategies) in Ontology Matching Task. In: Manfred Hauswirth, J.X.P., Euzenat, J. (eds.) International Semantic Web Conference—ISWC 2012, p. 12, Boston, États-Unis (2012)
Palopoli, L., Terracina, G., Ursino, D.: Experiences using dike, a system for supporting cooperative information system and data warehouse design. Inform. Syst. 28(7), 835–865 (2003)
Peukert, E., Eberius, J., Rahm, E.: AMC-a framework for modelling and comparing matching systems as matching processes. In: ICDE, pp. 1304–1307. IEEE (2011)
Peukert, E., Eberius, J., Rahm, E.: A self-configuring schema matching system. In: ICDE (2012)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Rodriguez-Gianolli, P., Mylopoulos, J.: A semantic approach to xml-based data integration. In: Kunii, H.S., Jajodia, S., Slvberg A.S. (eds.) Conceptual Modeling ER 2001, vol. 2224. Lecture Notes in Computer Science, pp. 117–132. Springer, Berlin (2001)
Sagi, T., Gal, A.: Non-binary evaluation for schema matching. In: Conceptual Modelling—ER 2012, Oct. (2012)
Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. (CSUR) 22(3), 183–236 (1990)
Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. J. Data Semant. IV, 146–171 (2005)
Smith, K., Morse, M., Mork, P., Li, M., Rosenthal, A., Allen, D., Seligman, L., Wolf, C.: The role of schema matching in large enterprises. In: Proc, CIDR (2009)
Steel, R.G.D., Torrie, J.H.: Principles and Procedures of Statistics. McGraw-Hill, New York (1960)
Tu, K., Yu, Y.: CMC: Combining multiple schema-matching strategies based on credibility prediction. In: Zhou, L., Ooi, B., Meng, X. (eds.) Database Systems for Advanced Applications, vol. 3453. LNCS, pp. 995–995. Springer, Berlin (2005)
Wang, J., Wen, J., Lochovsky, F., Ma, W.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 408–419. VLDB Endowment (2004)
Yang, X., Lee, M., Ling, T.: Resolving structural conflicts in the integration of xml schemas: A semantic approach. In: Song, I.-Y., Liddle, S., Ling, T.-W., Scheuermann, P. (eds.) Conceptual Modeling—ER 2003, vol. 2813. Lecture Notes in Computer Science, pp. 520–533. Springer, Berlin (2003)
Zobel, J., Moffat, A.: Exploring the similarity space. SIGIR Forum 32, 18–34 (1998)
Acknowledgments
The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement number 256955.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sagi, T., Gal, A. Schema matching prediction with applications to data source discovery and dynamic ensembling. The VLDB Journal 22, 689–710 (2013). https://doi.org/10.1007/s00778-013-0325-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-013-0325-y