Abstract
In this paper, we present a method for database schema matching: the problem of identifying elements of two given schemas that correspond to each other. Schema matching is useful in e-commerce exchanges, in data integration/warehousing, and in semantic web applications. We first present two corpus-based methods: one method is for determining the semantic similarity of two target words and the other is for automatic word segmentation. Then we present a name-based element-level database schema matching method that exploits both the semantic similarity and the word segmentation methods. Our word similarity method uses pointwise mutual information (PMI) to sort lists of important neighbor words of two target words; the words which are common in both lists are selected and their PMI values are aggregated to calculate the relative similarity score. Our word segmentation method uses corpus type frequency information to choose the type with maximum length and frequency from “desegmented” text. It also uses a modified forward–backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. Finally, we exploit both the semantic similarity and the word segmentation methods in our proposed name-based element-level schema matching method. This method uses a single property (i.e., element name) for schema matching and nevertheless achieves a measure score that is comparable to the methods that use multiple properties (e.g., element name, text description, data instance, context description). Our schema matching method also uses normalized and modified versions of the longest common subsequence string matching algorithm with weight factors to allow for a balanced combination. We validate our methods with experimental studies, the results of which suggest that these methods can be a useful addition to the set of existing methods.
Similar content being viewed by others
References
Allison L. and Dix T.I. (1986). A bit-string longest-common-Subsequence algorithm. Inf. Process. Lett. 23: 305–310
Batini C., Lenzerini M. and Navathe S.B. (1986). A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18(4): 323–364
Bell G.S. and Sethi A. (2001). Matching records in a national medical patient index. Commun. ACM (CACM) 44(9): 83–88
Brent M. (1999). An efficient, probabilistically sound algorithm for segmentation and word discovery. Mach. Learning 34: 71–106
Brent M. and Cartwright T. (1996). Distributional regularity and phonotactics are useful for segmentation. Cognition 61: 93–125
Bright M.W., Hurson A.R. and Pakzad S.H. (1994). Automated resolution of semantic heterogeneity in multi databases. Trans. Database Systems (TODS) 19(2): 212–253
Brill, E.: Some advances in transformation-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pp. 748–753. AAI Press/MIT Press (1994)
Brown P.F., DeSouza P.V., Mercer R.L., Watson T.J., Della Pietra V.J. and Lai J.C. (1992). Class-based n-gram models of natural language. Comput. Linguist. 18: 467–479
Budanitsky, A., Hirst, G.: Evaluating WordNet-based measures of semantic distance. Comput. Linguist. 32(1) (2006)
Buckley, C., Salton, J.A., Singhal, A.: Automatic query expansion using Smart: TREC 3. In Proceedings of the Third Text Retrieval Conference, Gaithersburg (1995)
Christiansen, M., Allen, J.: Coping with Variation in Speech Segmentation. In: Proceedings of GALA 1997: Language Acquisition: Knowledge Representation and Processing, pp. 327–332 (1997)
Christiansen M., Allen J. and Seidenberg M. (1998). Learning to segment speech using multiple cues: a connectionist model. Language Cogn. Process. 13: 221–268
Church K.W. and Hanks P. (1990). Word association norms, mutual information and lexicography. Comput. Linguist. 16(1): 22–29
Daelamans W., van den Bosch A. and Weijters A. (1997). IGTree: Using trees for compression and classification in lazy learning algorithms. Artif. Intell. Rev. 11: 407–423
Dagan I., Lee L. and Pereira F.C.N. (1999). Similarity based models of word cooccurrence probabilities. Mach. Learning 34(1–3): 43–69
Dale R., Moisl H. and Somers H. (2000). Handbook of Natural Language Processing. Marcel Dekker, Inc., New York
Deligne, S., Bimbot, F.: Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP-95) (1995)
Marcken C. (1995). The unsupervised acquisition of a lexicon from continuous speech. M.I.T., Cambridge, MA, Technical Report AI Memo No. 1558
Do, H.H., Rahm, E.: COMA—a system for flexible combination of schema matching approaches. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 610–621 (2002)
Dunning T. (1993). Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19: 61–74
Fung, P., Wu, D.: Improving Chinise tokenization with linguistic filters on statistical lexical acquisition. In: Fourth Conference Applied Natural Language Processing, Stuttgart, pp. 180–181 (1994)
Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named entity recognition: a pragmatic approach. Comput. Linguist. 31(4) (2005)
Grefenstette, G.: Automatic thesaurus generation from raw text using knowledge-poor techniques. In: Making sense of Words, 9th Annual Conference of the UW Centre for the New OED and Text Research (1993)
Grefenstette, G.: Finding semantic similarity in raw text: the deese antonyms. In: Goldman, R., Norvig, P., Charniak, E., Gale, B. (eds.) Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language, pp. 61–65. AAAI Press (1992)
Hua, Y.: Unsupervised word induction using MDL criterion. In: Proceedings ISCSL 2000, Beijing (2000)
Inkpen, D., Désilets, A.: Semantic similarity for detecting recognition errors in automatic speech transcripts. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP 2005), Vancouver (2005)
Jarmasz, M., Szpakowicz, S.: Roget’s thesaurus and semantic similarity. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-2003), Borovets, Bulgaria, pp. 212–219 (2003)
Kang, J., Naughton, J.F.: On schema matching with opaque column names and data values. In: Proceedings of Special Interest Group on Management of Data (SIGMOD 2003), San Diego, pp. 205–216 (2003)
Kit, C., Wilks, Y.: Unsupervised learning of word boundary with description length gain. In: Proceedings CoNLL99 ACL Workshop, Bergen (1999)
Kondrak, G.: N-gram similarity and distance. In: Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), Buenos Aires, pp. 115–126 (2005)
Landauer T.K. and Dumais S.T. (1997). A solution to plato’s problem: the latent semantic analysis theory of the acquisition, induction and representation of knowledge. Psychol. Rev. 104(2): 211–240
Lee, L.: Measures of distributional similarity. In: Proceedings of the Association for Computational Linguistics (ACL-1999), pp. 23–32 (1999)
Lesk M.E. (1969). Word-word associations in document retrieval systems. Am. Doc. 20(1): 27–38
Lesk, M.E.: Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In: Proceedings of the Conference of Special Interest Group on Design Of Communication (SIGDOC), Toronto (1986)
Li, H., Abe, N.: Word clustering and disambiguation based on co-occurrence data. In: Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING-ACL), pp. 749–755 (1998)
Lin, C.Y., Hovy, E.H. (2003) Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of Human Language Technology Conference (HLT-NAACL 2003), Edmonton
Lin, D.: Automatic retrieval and clustering of similar words. In Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING-ACL), pp. 768–774 (1998)
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304 (1998)
Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based Schema Matching. In: International Conference on Data Engineering (ICDE-05), pp. 57–68 (2005)
Melamed I.D. (1999). Bitext maps and alignment via pattern recognition. Comput. Linguist. 25(1): 107–130
Mikheev A. (1997). Automatic rule induction for unknown word guessing. Comput. Linguist. 23(3): 405–423
Miller G.A. and Charles W.G. (1991). Contextual correlates of semantic similarity. Language Cogn. Process. 6(1): 1–28
Miller G.A. (1995). WordNet: a lexical database for English. Commun. ACM 38(11): 39–41
Milo, T., Zohar, S.: Using Schema matching to simplify heterogeneous data translation. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 122–133 (1998)
Pantel, P., Lin, D.: Discovering word senses from text. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 613–619 (2002)
Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet: Similarity—measuring the relatedness of concepts. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), July 25–29, San Jose (Intelligent Systems Demonstration)
Peng, F., Schuurmans, D.: A hierarchical EM approach to word segmentation. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001). pp. 475–480, Tokyo, Japan (2001)
Rabiner L. (1989). A tutorial on hidden markov models and selected applications in speech recognition. In: Proc. IEEE 77(2): 257–286
Rahm E. and Bernstein P.A. (2001). A survey of approaches to automatic schema matching. Int. J. Very Large Data Bases (VLDB) 10(4): 334–350
Rao C.R. (1983). Diversity: Its measurement, decomposition, apportionment and analysis. Sankyha: Indian J. Statist. 44(A): 1–22
Resnik, P.: Using information content to evaluate semantic similarity. In Proceedings of 14th International Joint Conference on Artificial Intelligence, Montreal, pp. 448–453 (1995)
Rosenfeld R. (1996). A maximum entropy approach to adaptive statistical language modeling. Comput Speech Language 10: 187–228
Rubenstein H. and Goodenough J.B. (1965). Contextual correlates of synonymy. Commun. ACM 8(10): 627–633
Rumelhart, D.E., McClelland, J.: On learning the past tense of English verbs. In: Parallel Distributed Processing, vol. II, pp. 216–271, MIT Press, Cambridge, (1986)
Saffran J.R., Newport E.L. and Aslin R.N. (1996). Word segmentation: The role of distributional cues. J. Memory Language 35: 606–621
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill (1983)
Seligman, L., Rosenthal, A., Lehner, P., Smith, A.: Data integration: Where does the time go? Bull. Tech. Comm. Data Eng. 25(3) (2002)
Shannon C.E. and Weaver W. (1963). The mathematical theory of communication. University of Illinois Press, Urbana
Sproat R., Shih C., Gale W. and Chang N. (1996). A stochastic finite-state word-segmentation algorithm for Chinese. Comput. Linguist. 22(3): 377–404
Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic word segmentation algorithm for a Mandarin text-to-speech system. In: 32nd Annual Meeting of the Association for Computational Linguistics pp. 66–72. Las Cruces (1994)
Turney, P.D.: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In: Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), pp. 491–502 (2001)
Vechtomova, O., Robertson, S.: Integration of collocation statistics into the probabilistic retrieval model. In: 22nd Annual Colloquium on Information Retrieval Research, Cambridge (2000)
Weeds, J., Weir, D., McCarthy, D.: Characterising measures of lexical distributional similarity. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING-2004 pp. 1015–1021. Geneva (2004)
Xu J. and Croft B. (2000). Improving the effectiveness of information retrieval. ACM Trans. Inf. Syst. 18(1): 79–112
Yarowsky, D.: Word-sense disambiguation using statistical models of Roget’s categories trained on large corpora. In: Proceedings of the International Conference on Computational Linguistics (COLING-92), pp. 454–460, Nantes, (1992)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Islam, A., Inkpen, D. & Kiringa, I. Applications of corpus-based semantic similarity and word segmentation to database schema matching. The VLDB Journal 17, 1293–1320 (2008). https://doi.org/10.1007/s00778-007-0067-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-007-0067-9