Abstract
Most recent schema matching systems assemble multiple components, each employing a particular matching technique. The domain user mustthen tune the system: select the right component to be executed and correctly adjust their numerous “knobs” (e.g., thresholds, formula coefficients). Tuning is skill and time intensive, but (as we show) without it the matching accuracy is significantly inferior. We describe eTuner, an approach to automatically tune schema matching systems. Given a schema S, we match S against synthetic schemas, for which the ground truth mapping is known, and find a tuning that demonstrably improves the performance of matching S against real schemas. To efficiently search the huge space of tuning configurations, eTuner works sequentially, starting with tuning the lowest level components. To increase the applicability of eTuner, we develop methods to tune a broad range of matching components. While the tuning process is completely automatic, eTuner can also exploit user assistance (whenever available) to further improve the tuning quality. We employed eTuner to tune four recently developed matching systems on several real-world domains. The results show that eTuner produced tuned matching systems that achieve higher accuracy than using the systems with currently possible tuning methods.
Similar content being viewed by others
References
Aberer K. (2003) Special issue on peer to peer data management. SIGMOD Rec. 32(3): 138–140
Agrawal, S., Chaudhuri, S., Kollr, L., Marathe, A.P., Narasayya, V.R., Syamala, M.: Database tuning advisor for microsoft sql server 2005. In: VLDB, 2004
Andritsos, P., Miller, R.J., Tsaparas, P.: Information-theoretic tools for mining database structure from large data sets. In: Proceedings of SIGMOD, 2004
Aslan G., McLeod D. (1999) Semantic heterogeneity resolution in federated databases by metadata implantation and stepwise evolution. VLDB J. 8(2): 120–132
Batini C., Lenzerini M., Navathe S.B. (1986) A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18(4): 323–364
Benjelloun O., Garcia-Molina H., Jonas J., Su Q., Widom J. (2005) Swoosh: a generic approach to entity resolution. Technical report, Stanford University
Bergamaschi S., Castano S., Vincini M., Beneventano D. (2001) Semantic integration of heterogeneous information sources. Data Knowl. Eng. 36(3): 215–249
Berlin, J., Motro, A.: Autoplex: automated discovery of content for virtual databases. In: Proceedings of the Conference on Cooperative Information Systems (CoopIS), 2001
Berlin, J., Motro, A.: Database schema matching using machine learning with feature selection. In: Proceedings of the Conference on Advanced Information Systems Engineering (CAiSE), 2002
Bernstein, P.A., Melnik, S., Petropoulos, M., Quix, C.: Industrial-strength schema matching. SIGMOD Record, Special Issue in Semantic Integration, December 2004
Bilke, A., Naumann, F.: Schema matching using duplicates. In: Proceedings of the International Conference on Data Engineering (ICDE), 2005
Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic text segmentation for extracting structured records. In: Proceedings of SIGMOD-01
Brown, A., Kar, G., Keller, A.: An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In: Proceedings of the Seventh IFIP/IEEE International Symposium on Integrated Network Management (IM), 2001
Castano, S., De Antonellis, V.: A schema analysis and reconciliation tool environment. In: Proceedings of the International Database Engineering and Applications Symposium (IDEAS), 1999
Chaudhuri, S., Dageville, B., Lohman, G.: Self-managing technology in database management systems (tutorial). In: Proceedings of VLDB, 2004
Chaudhuri, S., Weikum, G.: Rethinking database system architecture: towards a self-tuning risc-style database system. In: VLDB, 2000
Chidlovskii, B.: Automatic repairing of web wrappers. In: Third International Workshop on Web Information and Data Management, 2001
Clifton, C., Housman, E., Rosenthal, A.: Experience with a combined approach to attribute-matching across heterogeneous databases. In: Proceedings of the IFIP Working Conference on Data Semantics (DS-7), 1997
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos P.: iMAP: discovering complex matches between database schemas. In: Proceedings of SIGMOD, 2004
Dietterich T.G. (1997) Machine learning research: four current directions. AI Mag. 18(4): 97–136
Do, H.: Schema matching and Mapping-based Data Integration. PhD Thesis, University of Leipzig, 2006
Do, H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In: Proceedings of the 2nd International Workshop on Web Databases (German Informatics Society), 2002
Do, H., Rahm, E.: Coma: a system for flexible combination of schema matching approaches. In: Proceedings of the 28th Conference on Very Large Databases (VLDB), 2002
Doan, A.: Learning to Map between Structured Representations of Data. PhD Thesis, University of Washington, 2003
Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: A machine learning approach. In: Proceedings of the ACM SIGMOD Conference, 2001
Doan A., Domingos P., Halevy A. (2003) Learning to match the database schemas: a multistrategy approach. Mach. Learn. 50(3): 279–301
Doan A., Madhavan Dhamankar R., Domingos P., Halevy A. (2003) Learning to match ontologies on the Semantic Web. VLDB J. 12, 303–319
Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Learning to map ontologies on the semantic web. In: Proceedings of the World-Wide Web Conference (WWW-02), 2002
Doan A., Noy N., Halevy A. (2004) Introduction to the special issue on semantic integration. SIGMOD Rec. 33(4): 11–13
Embley, D., Jackman, D., Xu, L.: Multifaceted exploitation of metadata for attribute match discovery in information integration. In: Proceedings of the WIIW-01, 2001
Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. In: Proceedings of the ACM SIGIR Conference, 2004
Freitag, D.: Machine learning for information extraction in informal domains. PhD. Thesis, Deptartment of Computer Science, Carnegie Mellon University, 1998
Ganti, V., Chaudhuri, S., Motwani, R.: Robust identification of fuzzy duplicates. In: ICDE, 2005
He, B., Chang, K.: Statistical schema matching across web query interfaces. In: Proceedings of the ACM SIGMOD Conference (SIGMOD), 2003
He, B., Chang, K.C.C., Han, J.: Discovering complex matchings across Web query interfaces: a correlation mining approach. In: Proceedings of the ACM SIGKDD Conference (KDD), 2004
Kang, J., Naughton, J.: On schema matching with opaque column names and data values. In: Proceedings of the ACM SIGMOD International Conference on Management of Data SIGMOD-03), 2003
Keim, G., Shazeer, N., Littman, M., Agarwal, S.: Cheves, C., Fitzgerald, J., Grosland, J., Jiang, F., Pollard, S., Weinmeister, K.: PROVERB: the probabilistic cruciverbalist. In: Proceeedings of the 6th National Conference on Artificial Intelligence (AAAI-99), pp. 710–717 (1999)
Kushmerick N. (2000) Wrapper verification. World Wide Web J. 3(2): 79–94
Lerman K., Minton S., Knoblock C. (2003) Wrapper maintenance: a machine learning approach. J. Artif. Intell. Res. 18:149–187
Li W., Clifton C., Liu S. (2000) Database integration using neural network: implementation and experience. Knowl. Inf. Syst. 2(1): 73–96
Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based schema matching. In: Proceedings of the 18th IEEE International Conf. on Data Engineering (ICDE), 2005
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of VLDB, 2001
McCann, R., Alshebli, B., Le, Q., Nguyen, H., Vu, L., Doan, A.: Mapping maintenance for data integration systems. In: Proceedings of VLDB 2005
McCann, R., Doan, A., Kramnik, A.: Varadarajan, V.: Building data integration systems via mass collaboration. In: Proceedings of the SIGMOD-03 Workshop on the Web and Databases (WebDB-03), 2003
McCann, R., Kramnik, A., Shen, W., Varadarajan, V., Sobulo, O., Doan, A.: Integrating data from disparate sources: a mass collaboration approach. In: Proceedings of the International Conference on Data Engineering (ICDE), 2005
Melnik, S., Molina-Garcia, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm. In: Proceedings of the International Conference on Data Engineering (ICDE), 2002
Melville P., Mooney R. (2004) Creating diversity in ensembles using artificial data. J. Inf. Fusion Spec. Issue Divers. Mult. Classifier Syst. 6(1):99–111
Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for web-data extraction. In: Fifth International Workshop on Web Information and Data Management, 2003
Milo, T., Zohar, S.: Using schema matching to simplify heterogeneous data translation. In: Proceedings of the International Conference on Very Large Databases (VLDB), 1998
Mitchell T. (1997) Machine Learning. McGraw-Hill, NY
Mitra, P., Wiederhold, G., Jannink, J.: Semi-automatic integration of knowledge sources. In: Proceedings of Fusion, 1999
Neumann, F., Ho, C.T., Tian, X., Haas, L., Meggido, N.: Attribute classification using feature analysis. In: Proceedings of the International Conference on Data Engineering (ICDE), 2002
Noy, N.F., Musen, M.A.: PROMPT: algorithm and tool for automated ontology merging and alignment. In: Proceedings of the National Conference on Artificial Intelligence (AAAI), 2000
Noy, N.F., Musen, M.A.: Anchor-PROMPT: using non-local context for semantic Matching. In: Proceedings of the Workshop on Ontologies and Information Sharing at the International Joint Conference on Artificial Intelligence (IJCAI), 2001
Ouksel A., Seth A.P. (1999) Special issue on semantic interoperability in global information systems. SIGMOD Re. 28(1): 5–12
Palopoli, L., Sacca, D., Terracina, G., Ursino, D.: A unififed graph-based framework for deriving nominal interscheme properties, type conflicts, and object cluster similarities. In: Proceedings of the Conf. on Cooperative Information Systems (CoopIS), 1999
Palopoli, L., Sacca, D., Ursino, D.: Semi-automatic, semantic discovery of properties from database schemes. In: Proceedings of the International Database Engineering and Applications Symposium (IDEAS-98), pp. 244–253 (1998)
Palopoli, L., Terracina, G., Ursino, D.: The system DIKE: towards the semi-automatic synthesis of cooperative information systems and data warehouses. In: Proceedings of the ADBIS-DASFAA Conference, 2000
Patterson, D.A., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., Treuhaft, N.: Recovery-oriented computing (ROC): motivation, definition, techniques, and case studies. Technical Report UCB//CSD-02-1175, University of California, 2002
Perkowitz, M., Etzioni, O.: Category translation: Learning to understand information on the Internet. In: Proceedigns of Internatinal Joint Conference on AI (IJCAI), 1995
Punyakanok, V., Roth, D.: The use of classifiers in sequential inference. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS-00), 2000
Rahm E., Bernstein P.A. (2001) On matching schemas automatically. VLDB J. 10(4): 334–350
Rahm, E. Do, H., Massmann, S.: Matching large XML schemas. SIGMOD Record, Special Issue in Semantic Integration, December 2004
Rahm, E., Thor, A., Aumueller, D., Do, H., Golovin, N., Kirsten, T.: iFuice—Information fusion utilizing instance correspondences and peer mappings. In: Proceedings of the Eighth International Workshop on the Web and Databases (WebDB), 2005
Ryutaro, I., Hideaki, T., Shinichi, H.: Rule induction for concept hierarchy alignment. In: Proceedings of the 2nd Workshop on Ontology Learning at the 17th International Joint Conference on AI (IJCAI), 2001
Sayyadian, M., LeKhac, H., Doan, A., Gravano, L.: Keyword search across heterogeneous relational databases. Technical report, Department of Computer Science, Universtiy of Illinois (2006)
Seligman, L., Rosenthal, A.: The impact of xml in databases and data sharing. IEEE Computer, 2001
UIMA: Unstructured information management architecture. http://www.research.ibm.com/UIMA/
Velegrakis, Y., Miller, R., Popa, L., Mylopoulos, J.: Tomas: a system for adapting mappings while schemas evolve. In: Proceedings of the Twentieth International Conference on Data Engineering, 2004
Weis, M., Naumann, F.: Dogmatix tracks down duplicates in xml. In: Proceedings of the ACM Conference on Management of Data (SIGMOD), 2005
Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the Deep Web. In: Proceedings of SIGMOD, 2004
Xu, L., Embley, D.: Using domain ontologies to discover direct and indirect matches for schema elements. In: Proceedigns of the Semantic Integration Workshop at ISWC-03. http://smi.stanford.edu/si2003, 2003
Yan, L.L., Miller, R.J., Haas, L.M., Fagin, R.: Data driven understanding and refinement of schema mappings. In: Proceedings of the ACM SIGMOD, 2001
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lee, Y., Sayyadian, M., Doan, A. et al. eTuner: tuning schema matching software using synthetic scenarios. The VLDB Journal 16, 97–122 (2007). https://doi.org/10.1007/s00778-006-0024-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-006-0024-z