Skip to main content
Log in

Schema matching prediction with applications to data source discovery and dynamic ensembling

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Web-scale data integration involves fully automated efforts which lack knowledge of the exact match between data descriptions. In this paper, we introduce schema matching prediction, an assessment mechanism to support schema matchers in the absence of an exact match. Given attribute pair-wise similarity measures, a predictor predicts the success of a matcher in identifying correct correspondences. We present a comprehensive framework in which predictors can be defined, designed, and evaluated. We formally define schema matching evaluation and schema matching prediction using similarity spaces and discuss a set of four desirable properties of predictors, namely correlation, robustness, tunability, and generalization. We present a method for constructing predictors, supporting generalization, and introduce prediction models as means of tuning prediction toward various quality measures. We define the empirical properties of correlation and robustness and provide concrete measures for their evaluation. We illustrate the usefulness of schema matching prediction by presenting three use cases: We propose a method for ranking the relevance of deep Web sources with respect to given user needs. We show how predictors can assist in the design of schema matching systems. Finally, we show how prediction can support dynamic weight setting of matchers in an ensemble, thus improving upon current state-of-the-art weight setting methods. An extensive empirical evaluation shows the usefulness of predictors in these use cases and demonstrates the usefulness of prediction models in increasing the performance of schema matching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. The proposed term of a similarity space should not be confused with the one proposed by Zobel and Moffat [50] in the context of document vector spaces.

  2. According to Cohen [7], correlation values over 0.01 represent a small effect, over 0.09 a medium effect and over 0.25 a large effect.

  3. https://bitbucket.org/tomers77/ontobuilder-research-environment/downloads/datasets.zip.

  4. http://www.nisb-project.eu/.

  5. http://oaei.ontologymatching.org/.

References

  1. Batini, C., Lenzerini, M., Navathe, S.B.: A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. (CSUR) 18(4), 323–364 (1986)

    Article  Google Scholar 

  2. Bellahsene, Z.: Schema Matching and Mapping. Springer, New York (2011)

    Book  MATH  Google Scholar 

  3. Bergamaschi, S., Castano, S., Vincini, M., Beneventano, D.: Semantic integration of heterogeneous information sources. Data Knowl. Eng. 36(3), 215–249 (2001)

    Article  MATH  Google Scholar 

  4. Bernstein, P.A., Melnik, S.: Meta data management. In: ICDE, p. 875. IEEE (2004)

  5. Castano, S., Antonellis, V.D.: Global viewing of heterogeneous data sources. IEEE Trans. Knowl. Data Eng. 13(2), 277–297 (2001)

    Article  Google Scholar 

  6. Cheng, R., Gong, J., Cheung, D.: Managing uncertainty of XML schema matching. In: Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pp. 297–308 (2010)

  7. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum, Hillsdale (1988)

    MATH  Google Scholar 

  8. Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 861–874, New York, NY, USA, ACM (2008)

  9. Do, H.-H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In: Chaudhri, A., Jeckle, M., Rahm, E., Unland, R. (eds.) Web, Web-Services, and Database Systems, vol. 2593, LNCS, pp. 221–237. Springer, Berlin (2003)

    Chapter  Google Scholar 

  10. Do, H.H., Rahm, E.: Coma: a system for flexible combination of schema matching approaches. In: Proceedings of VLDB, pp. 610–621. VLDB Endowment (2002)

  11. Doan, A.H., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: ACM SIGMOD Record, vol. 30, pp. 509–520. ACM (2001)

  12. Doan, A.H., Madhavan, J., Domingos, P., Halevy, A.: Learning to map between ontologies on the semantic web. In: Proceedings of the 11th International Conference on World Wide Web, pp. 662–673. ACM Press (2002)

  13. Dong, X., Halevy, A., Yu, C.: Data integration with uncertainty. VLDB J. 18, 469–500 (2009)

    Article  Google Scholar 

  14. dos Santos Mello, R., Castano, S., Heuser, C.A.: A method for the unification of xml schemata. Inform. Softw. Technol. 44(4), 241–249 (2002)

    Article  Google Scholar 

  15. Draper, N., Smith, H.: Applied Regression Analysis, 2nd edn. Wiley, New York (1981)

    MATH  Google Scholar 

  16. Euzenat, J.: Semantic precision and recall for ontology alignment evaluation. In: Proc. IJCAI, pp. 348–353 (2007)

  17. Franklin, M., Halevy, A., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)

    Article  Google Scholar 

  18. Gal, A.: Uncertain schema matching. Synth. Lect. Data Manag. 3(1), 1–97 (2011)

    Article  MathSciNet  Google Scholar 

  19. Gal, A., Anaby-Tavor, A., Trombetta, A., Montesi, D.: A framework for modeling and evaluating automatic semantic reconciliation. VLDB J. 14(1), 50–67 (2005)

    Article  Google Scholar 

  20. Gal, A., Modica, G., Jamil, H., Eyal, A.: Automatic ontology matching using application semantics. AI Mag. 26(1), 21 (2005)

    Google Scholar 

  21. Gal, A., Sagi, T.: Tuning the ensemble selection process of schema matchers. Inform. Syst. 35(8), 845–859 (2010)

    Article  Google Scholar 

  22. He, B., Chang, K.C.-C.: Statistical schema matching across web query interfaces. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD ’03, pp. 217–228, New York, NY, USA, ACM (2003)

  23. Kang, J., Naughton, J.F.: On schema matching with opaque column names and data values. In: International Conference on Management of Data: Proceedings of the 2003 ACM SIGMOD International Conference on Management of data, vol. 9, pp. 205–216 (2003)

  24. Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: Xclust: clustering xml schemas for effective integration. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, CIKM ’02, pp. 292–299, New York, NY, USA, ACM (2002)

  25. Lee, Y., Sayyadian, M., Doan, A.H., Rosenthal, A.S.: eTuner: tuning schema matching software using synthetic scenarios. VLDB J. 16(1), 97–122 (2007)

    Article  Google Scholar 

  26. Luenberger, D.: Optimization by Vector Space Methods. Wiley-Interscience, New York (1997)

    Google Scholar 

  27. Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based schema matching. In: Proc. ICDE, pp. 57–68, April (2005)

  28. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 49–58 (2001)

  29. Madhavan, J., Jeffery, S., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A.: Web-scale data integration: You can only afford to pay as you go. In: Proceedings of CIDR, pp. 342–350 (2007)

  30. Magnani, M., Rizopoulos, N., McBrien, P., Montesi, D.: Schema integration based on uncertain semantic mappings. In: Conceptual Modeling ER 2005, pp. 31–46 (2005)

  31. Mao, M., Peng, Y. Spring, M.: A harmony based adaptive ontology mapping approach. In: Proc. of SWWS (2008)

  32. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: ICDE, pp. 117–128. IEEE (2002)

  33. Meo, P.D., Quattrone, G., Terracina, G., Ursino, D.: Integration of xml schemas at various severity levels. Inform. Syst. 31(6), 397–434 (2006)

    Article  Google Scholar 

  34. Miles, J., Shevlin, M.: Applying Regression and Correlation: A Guide for Students and Researchers. Sage, London (2001)

    Google Scholar 

  35. Miller, R.J., Hernandez, M.A., Haas, L.M., Yan, L.-L., Ho, C.T.H., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIgMOD Rec. 30(1), 78–83 (2001)

    Article  Google Scholar 

  36. Ngo, D.H., Bellahsene, Z.: Evaluating the Interaction between the different Matchers (or Strategies) in Ontology Matching Task. In: Manfred Hauswirth, J.X.P., Euzenat, J. (eds.) International Semantic Web Conference—ISWC 2012, p. 12, Boston, États-Unis (2012)

  37. Palopoli, L., Terracina, G., Ursino, D.: Experiences using dike, a system for supporting cooperative information system and data warehouse design. Inform. Syst. 28(7), 835–865 (2003)

    Article  Google Scholar 

  38. Peukert, E., Eberius, J., Rahm, E.: AMC-a framework for modelling and comparing matching systems as matching processes. In: ICDE, pp. 1304–1307. IEEE (2011)

  39. Peukert, E., Eberius, J., Rahm, E.: A self-configuring schema matching system. In: ICDE (2012)

  40. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)

    Article  MATH  Google Scholar 

  41. Rodriguez-Gianolli, P., Mylopoulos, J.: A semantic approach to xml-based data integration. In: Kunii, H.S., Jajodia, S., Slvberg A.S. (eds.) Conceptual Modeling ER 2001, vol. 2224. Lecture Notes in Computer Science, pp. 117–132. Springer, Berlin (2001)

  42. Sagi, T., Gal, A.: Non-binary evaluation for schema matching. In: Conceptual Modelling—ER 2012, Oct. (2012)

  43. Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. (CSUR) 22(3), 183–236 (1990)

    Article  Google Scholar 

  44. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. J. Data Semant. IV, 146–171 (2005)

  45. Smith, K., Morse, M., Mork, P., Li, M., Rosenthal, A., Allen, D., Seligman, L., Wolf, C.: The role of schema matching in large enterprises. In: Proc, CIDR (2009)

  46. Steel, R.G.D., Torrie, J.H.: Principles and Procedures of Statistics. McGraw-Hill, New York (1960)

    MATH  Google Scholar 

  47. Tu, K., Yu, Y.: CMC: Combining multiple schema-matching strategies based on credibility prediction. In: Zhou, L., Ooi, B., Meng, X. (eds.) Database Systems for Advanced Applications, vol. 3453. LNCS, pp. 995–995. Springer, Berlin (2005)

    Google Scholar 

  48. Wang, J., Wen, J., Lochovsky, F., Ma, W.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 408–419. VLDB Endowment (2004)

  49. Yang, X., Lee, M., Ling, T.: Resolving structural conflicts in the integration of xml schemas: A semantic approach. In: Song, I.-Y., Liddle, S., Ling, T.-W., Scheuermann, P. (eds.) Conceptual Modeling—ER 2003, vol. 2813. Lecture Notes in Computer Science, pp. 520–533. Springer, Berlin (2003)

  50. Zobel, J., Moffat, A.: Exploring the similarity space. SIGIR Forum 32, 18–34 (1998)

    Article  Google Scholar 

Download references

Acknowledgments

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement number 256955.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Avigdor Gal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sagi, T., Gal, A. Schema matching prediction with applications to data source discovery and dynamic ensembling. The VLDB Journal 22, 689–710 (2013). https://doi.org/10.1007/s00778-013-0325-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-013-0325-y

Keywords

Navigation