eTuner: tuning schema matching software using synthetic scenarios

Lee, Yoonkyong; Sayyadian, Mayssam; Doan, AnHai; Rosenthal, Arnon S.

doi:10.1007/s00778-006-0024-z

eTuner: tuning schema matching software using synthetic scenarios

Special Issue Paper
Published: 14 September 2006

Volume 16, pages 97–122, (2007)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Yoonkyong Lee¹,
Mayssam Sayyadian¹,
AnHai Doan¹ &
…
Arnon S. Rosenthal²

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Most recent schema matching systems assemble multiple components, each employing a particular matching technique. The domain user mustthen tune the system: select the right component to be executed and correctly adjust their numerous “knobs” (e.g., thresholds, formula coefficients). Tuning is skill and time intensive, but (as we show) without it the matching accuracy is significantly inferior. We describe eTuner, an approach to automatically tune schema matching systems. Given a schema S, we match S against synthetic schemas, for which the ground truth mapping is known, and find a tuning that demonstrably improves the performance of matching S against real schemas. To efficiently search the huge space of tuning configurations, eTuner works sequentially, starting with tuning the lowest level components. To increase the applicability of eTuner, we develop methods to tune a broad range of matching components. While the tuning process is completely automatic, eTuner can also exploit user assistance (whenever available) to further improve the tuning quality. We employed eTuner to tune four recently developed matching systems on several real-world domains. The results show that eTuner produced tuned matching systems that achieve higher accuracy than using the systems with currently possible tuning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aberer K. (2003) Special issue on peer to peer data management. SIGMOD Rec. 32(3): 138–140
Google Scholar
Agrawal, S., Chaudhuri, S., Kollr, L., Marathe, A.P., Narasayya, V.R., Syamala, M.: Database tuning advisor for microsoft sql server 2005. In: VLDB, 2004
Andritsos, P., Miller, R.J., Tsaparas, P.: Information-theoretic tools for mining database structure from large data sets. In: Proceedings of SIGMOD, 2004
Aslan G., McLeod D. (1999) Semantic heterogeneity resolution in federated databases by metadata implantation and stepwise evolution. VLDB J. 8(2): 120–132
Article Google Scholar
Batini C., Lenzerini M., Navathe S.B. (1986) A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18(4): 323–364
Article Google Scholar
Benjelloun O., Garcia-Molina H., Jonas J., Su Q., Widom J. (2005) Swoosh: a generic approach to entity resolution. Technical report, Stanford University
Google Scholar
Bergamaschi S., Castano S., Vincini M., Beneventano D. (2001) Semantic integration of heterogeneous information sources. Data Knowl. Eng. 36(3): 215–249
Article MATH Google Scholar
Berlin, J., Motro, A.: Autoplex: automated discovery of content for virtual databases. In: Proceedings of the Conference on Cooperative Information Systems (CoopIS), 2001
Berlin, J., Motro, A.: Database schema matching using machine learning with feature selection. In: Proceedings of the Conference on Advanced Information Systems Engineering (CAiSE), 2002
Bernstein, P.A., Melnik, S., Petropoulos, M., Quix, C.: Industrial-strength schema matching. SIGMOD Record, Special Issue in Semantic Integration, December 2004
Bilke, A., Naumann, F.: Schema matching using duplicates. In: Proceedings of the International Conference on Data Engineering (ICDE), 2005
Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic text segmentation for extracting structured records. In: Proceedings of SIGMOD-01
Brown, A., Kar, G., Keller, A.: An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In: Proceedings of the Seventh IFIP/IEEE International Symposium on Integrated Network Management (IM), 2001
Castano, S., De Antonellis, V.: A schema analysis and reconciliation tool environment. In: Proceedings of the International Database Engineering and Applications Symposium (IDEAS), 1999
Chaudhuri, S., Dageville, B., Lohman, G.: Self-managing technology in database management systems (tutorial). In: Proceedings of VLDB, 2004
Chaudhuri, S., Weikum, G.: Rethinking database system architecture: towards a self-tuning risc-style database system. In: VLDB, 2000
Chidlovskii, B.: Automatic repairing of web wrappers. In: Third International Workshop on Web Information and Data Management, 2001
Clifton, C., Housman, E., Rosenthal, A.: Experience with a combined approach to attribute-matching across heterogeneous databases. In: Proceedings of the IFIP Working Conference on Data Semantics (DS-7), 1997
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos P.: iMAP: discovering complex matches between database schemas. In: Proceedings of SIGMOD, 2004
Dietterich T.G. (1997) Machine learning research: four current directions. AI Mag. 18(4): 97–136
Google Scholar
Do, H.: Schema matching and Mapping-based Data Integration. PhD Thesis, University of Leipzig, 2006
Do, H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In: Proceedings of the 2nd International Workshop on Web Databases (German Informatics Society), 2002
Do, H., Rahm, E.: Coma: a system for flexible combination of schema matching approaches. In: Proceedings of the 28th Conference on Very Large Databases (VLDB), 2002
Doan, A.: Learning to Map between Structured Representations of Data. PhD Thesis, University of Washington, 2003
Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: A machine learning approach. In: Proceedings of the ACM SIGMOD Conference, 2001
Doan A., Domingos P., Halevy A. (2003) Learning to match the database schemas: a multistrategy approach. Mach. Learn. 50(3): 279–301
Article MATH Google Scholar
Doan A., Madhavan Dhamankar R., Domingos P., Halevy A. (2003) Learning to match ontologies on the Semantic Web. VLDB J. 12, 303–319
Article Google Scholar
Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Learning to map ontologies on the semantic web. In: Proceedings of the World-Wide Web Conference (WWW-02), 2002
Doan A., Noy N., Halevy A. (2004) Introduction to the special issue on semantic integration. SIGMOD Rec. 33(4): 11–13
Article Google Scholar
Embley, D., Jackman, D., Xu, L.: Multifaceted exploitation of metadata for attribute match discovery in information integration. In: Proceedings of the WIIW-01, 2001
Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. In: Proceedings of the ACM SIGIR Conference, 2004
Freitag, D.: Machine learning for information extraction in informal domains. PhD. Thesis, Deptartment of Computer Science, Carnegie Mellon University, 1998
Ganti, V., Chaudhuri, S., Motwani, R.: Robust identification of fuzzy duplicates. In: ICDE, 2005
He, B., Chang, K.: Statistical schema matching across web query interfaces. In: Proceedings of the ACM SIGMOD Conference (SIGMOD), 2003
He, B., Chang, K.C.C., Han, J.: Discovering complex matchings across Web query interfaces: a correlation mining approach. In: Proceedings of the ACM SIGKDD Conference (KDD), 2004
Kang, J., Naughton, J.: On schema matching with opaque column names and data values. In: Proceedings of the ACM SIGMOD International Conference on Management of Data SIGMOD-03), 2003
Keim, G., Shazeer, N., Littman, M., Agarwal, S.: Cheves, C., Fitzgerald, J., Grosland, J., Jiang, F., Pollard, S., Weinmeister, K.: PROVERB: the probabilistic cruciverbalist. In: Proceeedings of the 6th National Conference on Artificial Intelligence (AAAI-99), pp. 710–717 (1999)
Kushmerick N. (2000) Wrapper verification. World Wide Web J. 3(2): 79–94
Article MATH Google Scholar
Lerman K., Minton S., Knoblock C. (2003) Wrapper maintenance: a machine learning approach. J. Artif. Intell. Res. 18:149–187
MATH Google Scholar
Li W., Clifton C., Liu S. (2000) Database integration using neural network: implementation and experience. Knowl. Inf. Syst. 2(1): 73–96
Article MATH Google Scholar
Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based schema matching. In: Proceedings of the 18th IEEE International Conf. on Data Engineering (ICDE), 2005
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of VLDB, 2001
McCann, R., Alshebli, B., Le, Q., Nguyen, H., Vu, L., Doan, A.: Mapping maintenance for data integration systems. In: Proceedings of VLDB 2005
McCann, R., Doan, A., Kramnik, A.: Varadarajan, V.: Building data integration systems via mass collaboration. In: Proceedings of the SIGMOD-03 Workshop on the Web and Databases (WebDB-03), 2003
McCann, R., Kramnik, A., Shen, W., Varadarajan, V., Sobulo, O., Doan, A.: Integrating data from disparate sources: a mass collaboration approach. In: Proceedings of the International Conference on Data Engineering (ICDE), 2005
Melnik, S., Molina-Garcia, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm. In: Proceedings of the International Conference on Data Engineering (ICDE), 2002
Melville P., Mooney R. (2004) Creating diversity in ensembles using artificial data. J. Inf. Fusion Spec. Issue Divers. Mult. Classifier Syst. 6(1):99–111
Google Scholar
Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for web-data extraction. In: Fifth International Workshop on Web Information and Data Management, 2003
Milo, T., Zohar, S.: Using schema matching to simplify heterogeneous data translation. In: Proceedings of the International Conference on Very Large Databases (VLDB), 1998
Mitchell T. (1997) Machine Learning. McGraw-Hill, NY
MATH Google Scholar
Mitra, P., Wiederhold, G., Jannink, J.: Semi-automatic integration of knowledge sources. In: Proceedings of Fusion, 1999
Neumann, F., Ho, C.T., Tian, X., Haas, L., Meggido, N.: Attribute classification using feature analysis. In: Proceedings of the International Conference on Data Engineering (ICDE), 2002
Noy, N.F., Musen, M.A.: PROMPT: algorithm and tool for automated ontology merging and alignment. In: Proceedings of the National Conference on Artificial Intelligence (AAAI), 2000
Noy, N.F., Musen, M.A.: Anchor-PROMPT: using non-local context for semantic Matching. In: Proceedings of the Workshop on Ontologies and Information Sharing at the International Joint Conference on Artificial Intelligence (IJCAI), 2001
Ouksel A., Seth A.P. (1999) Special issue on semantic interoperability in global information systems. SIGMOD Re. 28(1): 5–12
Article Google Scholar
Palopoli, L., Sacca, D., Terracina, G., Ursino, D.: A unififed graph-based framework for deriving nominal interscheme properties, type conflicts, and object cluster similarities. In: Proceedings of the Conf. on Cooperative Information Systems (CoopIS), 1999
Palopoli, L., Sacca, D., Ursino, D.: Semi-automatic, semantic discovery of properties from database schemes. In: Proceedings of the International Database Engineering and Applications Symposium (IDEAS-98), pp. 244–253 (1998)
Palopoli, L., Terracina, G., Ursino, D.: The system DIKE: towards the semi-automatic synthesis of cooperative information systems and data warehouses. In: Proceedings of the ADBIS-DASFAA Conference, 2000
Patterson, D.A., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., Treuhaft, N.: Recovery-oriented computing (ROC): motivation, definition, techniques, and case studies. Technical Report UCB//CSD-02-1175, University of California, 2002
Perkowitz, M., Etzioni, O.: Category translation: Learning to understand information on the Internet. In: Proceedigns of Internatinal Joint Conference on AI (IJCAI), 1995
Punyakanok, V., Roth, D.: The use of classifiers in sequential inference. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS-00), 2000
Rahm E., Bernstein P.A. (2001) On matching schemas automatically. VLDB J. 10(4): 334–350
Article MATH Google Scholar
Rahm, E. Do, H., Massmann, S.: Matching large XML schemas. SIGMOD Record, Special Issue in Semantic Integration, December 2004
Rahm, E., Thor, A., Aumueller, D., Do, H., Golovin, N., Kirsten, T.: iFuice—Information fusion utilizing instance correspondences and peer mappings. In: Proceedings of the Eighth International Workshop on the Web and Databases (WebDB), 2005
Ryutaro, I., Hideaki, T., Shinichi, H.: Rule induction for concept hierarchy alignment. In: Proceedings of the 2nd Workshop on Ontology Learning at the 17th International Joint Conference on AI (IJCAI), 2001
Sayyadian, M., LeKhac, H., Doan, A., Gravano, L.: Keyword search across heterogeneous relational databases. Technical report, Department of Computer Science, Universtiy of Illinois (2006)
Seligman, L., Rosenthal, A.: The impact of xml in databases and data sharing. IEEE Computer, 2001
UIMA: Unstructured information management architecture. http://www.research.ibm.com/UIMA/
Velegrakis, Y., Miller, R., Popa, L., Mylopoulos, J.: Tomas: a system for adapting mappings while schemas evolve. In: Proceedings of the Twentieth International Conference on Data Engineering, 2004
Weis, M., Naumann, F.: Dogmatix tracks down duplicates in xml. In: Proceedings of the ACM Conference on Management of Data (SIGMOD), 2005
Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the Deep Web. In: Proceedings of SIGMOD, 2004
Xu, L., Embley, D.: Using domain ontologies to discover direct and indirect matches for schema elements. In: Proceedigns of the Semantic Integration Workshop at ISWC-03. http://smi.stanford.edu/si2003, 2003
Yan, L.L., Miller, R.J., Haas, L.M., Fagin, R.: Data driven understanding and refinement of schema mappings. In: Proceedings of the ACM SIGMOD, 2001

Download references

Author information

Authors and Affiliations

University of Illinois, Urbana, IL, 61801, USA
Yoonkyong Lee, Mayssam Sayyadian & AnHai Doan
The MITRE Corporation, Bedford, MA, 01730, USA
Arnon S. Rosenthal

Authors

Yoonkyong Lee
View author publications
You can also search for this author in PubMed Google Scholar
Mayssam Sayyadian
View author publications
You can also search for this author in PubMed Google Scholar
AnHai Doan
View author publications
You can also search for this author in PubMed Google Scholar
Arnon S. Rosenthal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yoonkyong Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, Y., Sayyadian, M., Doan, A. et al. eTuner: tuning schema matching software using synthetic scenarios. The VLDB Journal 16, 97–122 (2007). https://doi.org/10.1007/s00778-006-0024-z

Download citation

Received: 15 January 2006
Accepted: 11 June 2006
Published: 14 September 2006
Issue Date: January 2007
DOI: https://doi.org/10.1007/s00778-006-0024-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

eTuner: tuning schema matching software using synthetic scenarios

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Two Phase User Driven Schema Matching

YAM: A Step Forward for Generating a Dedicated Schema Matcher

A Linear Program for Holistic Matching: Assessment on Schema Matching Benchmark

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

eTuner: tuning schema matching software using synthetic scenarios

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Two Phase User Driven Schema Matching

YAM: A Step Forward for Generating a Dedicated Schema Matcher

A Linear Program for Holistic Matching: Assessment on Schema Matching Benchmark

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now