Applications of corpus-based semantic similarity and word segmentation to database schema matching

Islam, Aminul; Inkpen, Diana; Kiringa, Iluju

doi:10.1007/s00778-007-0067-9

Applications of corpus-based semantic similarity and word segmentation to database schema matching

Regular Paper
Published: 18 October 2007

Volume 17, pages 1293–1320, (2008)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Aminul Islam¹,
Diana Inkpen¹ &
Iluju Kiringa¹

262 Accesses
13 Citations
Explore all metrics

Abstract

In this paper, we present a method for database schema matching: the problem of identifying elements of two given schemas that correspond to each other. Schema matching is useful in e-commerce exchanges, in data integration/warehousing, and in semantic web applications. We first present two corpus-based methods: one method is for determining the semantic similarity of two target words and the other is for automatic word segmentation. Then we present a name-based element-level database schema matching method that exploits both the semantic similarity and the word segmentation methods. Our word similarity method uses pointwise mutual information (PMI) to sort lists of important neighbor words of two target words; the words which are common in both lists are selected and their PMI values are aggregated to calculate the relative similarity score. Our word segmentation method uses corpus type frequency information to choose the type with maximum length and frequency from “desegmented” text. It also uses a modified forward–backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. Finally, we exploit both the semantic similarity and the word segmentation methods in our proposed name-based element-level schema matching method. This method uses a single property (i.e., element name) for schema matching and nevertheless achieves a measure score that is comparable to the methods that use multiple properties (e.g., element name, text description, data instance, context description). Our schema matching method also uses normalized and modified versions of the longest common subsequence string matching algorithm with weight factors to allow for a balanced combination. We validate our methods with experimental studies, the results of which suggest that these methods can be a useful addition to the set of existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

DB-GPT: Large Language Model Meets Database

Article Open access 19 January 2024

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

References

Allison L. and Dix T.I. (1986). A bit-string longest-common-Subsequence algorithm. Inf. Process. Lett. 23: 305–310
Article MathSciNet Google Scholar
Batini C., Lenzerini M. and Navathe S.B. (1986). A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18(4): 323–364
Article Google Scholar
Bell G.S. and Sethi A. (2001). Matching records in a national medical patient index. Commun. ACM (CACM) 44(9): 83–88
Article Google Scholar
Brent M. (1999). An efficient, probabilistically sound algorithm for segmentation and word discovery. Mach. Learning 34: 71–106
Article MATH Google Scholar
Brent M. and Cartwright T. (1996). Distributional regularity and phonotactics are useful for segmentation. Cognition 61: 93–125
Article Google Scholar
Bright M.W., Hurson A.R. and Pakzad S.H. (1994). Automated resolution of semantic heterogeneity in multi databases. Trans. Database Systems (TODS) 19(2): 212–253
Article Google Scholar
Brill, E.: Some advances in transformation-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pp. 748–753. AAI Press/MIT Press (1994)
Brown P.F., DeSouza P.V., Mercer R.L., Watson T.J., Della Pietra V.J. and Lai J.C. (1992). Class-based n-gram models of natural language. Comput. Linguist. 18: 467–479
Google Scholar
Budanitsky, A., Hirst, G.: Evaluating WordNet-based measures of semantic distance. Comput. Linguist. 32(1) (2006)
Buckley, C., Salton, J.A., Singhal, A.: Automatic query expansion using Smart: TREC 3. In Proceedings of the Third Text Retrieval Conference, Gaithersburg (1995)
Christiansen, M., Allen, J.: Coping with Variation in Speech Segmentation. In: Proceedings of GALA 1997: Language Acquisition: Knowledge Representation and Processing, pp. 327–332 (1997)
Christiansen M., Allen J. and Seidenberg M. (1998). Learning to segment speech using multiple cues: a connectionist model. Language Cogn. Process. 13: 221–268
Article Google Scholar
Church K.W. and Hanks P. (1990). Word association norms, mutual information and lexicography. Comput. Linguist. 16(1): 22–29
Google Scholar
Daelamans W., van den Bosch A. and Weijters A. (1997). IGTree: Using trees for compression and classification in lazy learning algorithms. Artif. Intell. Rev. 11: 407–423
Article Google Scholar
Dagan I., Lee L. and Pereira F.C.N. (1999). Similarity based models of word cooccurrence probabilities. Mach. Learning 34(1–3): 43–69
Article MATH Google Scholar
Dale R., Moisl H. and Somers H. (2000). Handbook of Natural Language Processing. Marcel Dekker, Inc., New York
Google Scholar
Deligne, S., Bimbot, F.: Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP-95) (1995)
Marcken C. (1995). The unsupervised acquisition of a lexicon from continuous speech. M.I.T., Cambridge, MA, Technical Report AI Memo No. 1558
Google Scholar
Do, H.H., Rahm, E.: COMA—a system for flexible combination of schema matching approaches. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 610–621 (2002)
Dunning T. (1993). Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19: 61–74
Google Scholar
Fung, P., Wu, D.: Improving Chinise tokenization with linguistic filters on statistical lexical acquisition. In: Fourth Conference Applied Natural Language Processing, Stuttgart, pp. 180–181 (1994)
Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named entity recognition: a pragmatic approach. Comput. Linguist. 31(4) (2005)
Grefenstette, G.: Automatic thesaurus generation from raw text using knowledge-poor techniques. In: Making sense of Words, 9th Annual Conference of the UW Centre for the New OED and Text Research (1993)
Grefenstette, G.: Finding semantic similarity in raw text: the deese antonyms. In: Goldman, R., Norvig, P., Charniak, E., Gale, B. (eds.) Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language, pp. 61–65. AAAI Press (1992)
Hua, Y.: Unsupervised word induction using MDL criterion. In: Proceedings ISCSL 2000, Beijing (2000)
Inkpen, D., Désilets, A.: Semantic similarity for detecting recognition errors in automatic speech transcripts. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP 2005), Vancouver (2005)
Jarmasz, M., Szpakowicz, S.: Roget’s thesaurus and semantic similarity. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-2003), Borovets, Bulgaria, pp. 212–219 (2003)
Kang, J., Naughton, J.F.: On schema matching with opaque column names and data values. In: Proceedings of Special Interest Group on Management of Data (SIGMOD 2003), San Diego, pp. 205–216 (2003)
Kit, C., Wilks, Y.: Unsupervised learning of word boundary with description length gain. In: Proceedings CoNLL99 ACL Workshop, Bergen (1999)
Kondrak, G.: N-gram similarity and distance. In: Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), Buenos Aires, pp. 115–126 (2005)
Landauer T.K. and Dumais S.T. (1997). A solution to plato’s problem: the latent semantic analysis theory of the acquisition, induction and representation of knowledge. Psychol. Rev. 104(2): 211–240
Article Google Scholar
Lee, L.: Measures of distributional similarity. In: Proceedings of the Association for Computational Linguistics (ACL-1999), pp. 23–32 (1999)
Lesk M.E. (1969). Word-word associations in document retrieval systems. Am. Doc. 20(1): 27–38
Article Google Scholar
Lesk, M.E.: Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In: Proceedings of the Conference of Special Interest Group on Design Of Communication (SIGDOC), Toronto (1986)
Li, H., Abe, N.: Word clustering and disambiguation based on co-occurrence data. In: Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING-ACL), pp. 749–755 (1998)
Lin, C.Y., Hovy, E.H. (2003) Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of Human Language Technology Conference (HLT-NAACL 2003), Edmonton
Lin, D.: Automatic retrieval and clustering of similar words. In Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING-ACL), pp. 768–774 (1998)
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304 (1998)
Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based Schema Matching. In: International Conference on Data Engineering (ICDE-05), pp. 57–68 (2005)
Melamed I.D. (1999). Bitext maps and alignment via pattern recognition. Comput. Linguist. 25(1): 107–130
Google Scholar
Mikheev A. (1997). Automatic rule induction for unknown word guessing. Comput. Linguist. 23(3): 405–423
Google Scholar
Miller G.A. and Charles W.G. (1991). Contextual correlates of semantic similarity. Language Cogn. Process. 6(1): 1–28
Article Google Scholar
Miller G.A. (1995). WordNet: a lexical database for English. Commun. ACM 38(11): 39–41
Article Google Scholar
Milo, T., Zohar, S.: Using Schema matching to simplify heterogeneous data translation. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 122–133 (1998)
Pantel, P., Lin, D.: Discovering word senses from text. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 613–619 (2002)
Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet: Similarity—measuring the relatedness of concepts. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), July 25–29, San Jose (Intelligent Systems Demonstration)
Peng, F., Schuurmans, D.: A hierarchical EM approach to word segmentation. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001). pp. 475–480, Tokyo, Japan (2001)
Rabiner L. (1989). A tutorial on hidden markov models and selected applications in speech recognition. In: Proc. IEEE 77(2): 257–286
Article Google Scholar
Rahm E. and Bernstein P.A. (2001). A survey of approaches to automatic schema matching. Int. J. Very Large Data Bases (VLDB) 10(4): 334–350
Article MATH Google Scholar
Rao C.R. (1983). Diversity: Its measurement, decomposition, apportionment and analysis. Sankyha: Indian J. Statist. 44(A): 1–22
Google Scholar
Resnik, P.: Using information content to evaluate semantic similarity. In Proceedings of 14th International Joint Conference on Artificial Intelligence, Montreal, pp. 448–453 (1995)
Rosenfeld R. (1996). A maximum entropy approach to adaptive statistical language modeling. Comput Speech Language 10: 187–228
Article Google Scholar
Rubenstein H. and Goodenough J.B. (1965). Contextual correlates of synonymy. Commun. ACM 8(10): 627–633
Article Google Scholar
Rumelhart, D.E., McClelland, J.: On learning the past tense of English verbs. In: Parallel Distributed Processing, vol. II, pp. 216–271, MIT Press, Cambridge, (1986)
Saffran J.R., Newport E.L. and Aslin R.N. (1996). Word segmentation: The role of distributional cues. J. Memory Language 35: 606–621
Article Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill (1983)
Seligman, L., Rosenthal, A., Lehner, P., Smith, A.: Data integration: Where does the time go? Bull. Tech. Comm. Data Eng. 25(3) (2002)
Shannon C.E. and Weaver W. (1963). The mathematical theory of communication. University of Illinois Press, Urbana
MATH Google Scholar
Sproat R., Shih C., Gale W. and Chang N. (1996). A stochastic finite-state word-segmentation algorithm for Chinese. Comput. Linguist. 22(3): 377–404
Google Scholar
Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic word segmentation algorithm for a Mandarin text-to-speech system. In: 32nd Annual Meeting of the Association for Computational Linguistics pp. 66–72. Las Cruces (1994)
Turney, P.D.: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In: Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), pp. 491–502 (2001)
Vechtomova, O., Robertson, S.: Integration of collocation statistics into the probabilistic retrieval model. In: 22nd Annual Colloquium on Information Retrieval Research, Cambridge (2000)
Weeds, J., Weir, D., McCarthy, D.: Characterising measures of lexical distributional similarity. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING-2004 pp. 1015–1021. Geneva (2004)
Xu J. and Croft B. (2000). Improving the effectiveness of information retrieval. ACM Trans. Inf. Syst. 18(1): 79–112
Article Google Scholar
Yarowsky, D.: Word-sense disambiguation using statistical models of Roget’s categories trained on large corpora. In: Proceedings of the International Conference on Computational Linguistics (COLING-92), pp. 454–460, Nantes, (1992)

Download references

Author information

Authors and Affiliations

School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada
Aminul Islam, Diana Inkpen & Iluju Kiringa

Authors

Aminul Islam
View author publications
You can also search for this author in PubMed Google Scholar
Diana Inkpen
View author publications
You can also search for this author in PubMed Google Scholar
Iluju Kiringa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aminul Islam.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Islam, A., Inkpen, D. & Kiringa, I. Applications of corpus-based semantic similarity and word segmentation to database schema matching. The VLDB Journal 17, 1293–1320 (2008). https://doi.org/10.1007/s00778-007-0067-9

Download citation

Received: 24 July 2006
Revised: 08 May 2007
Accepted: 08 July 2007
Published: 18 October 2007
Issue Date: August 2008
DOI: https://doi.org/10.1007/s00778-007-0067-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Applications of corpus-based semantic similarity and word segmentation to database schema matching

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

DB-GPT: Large Language Model Meets Database

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Applications of corpus-based semantic similarity and word segmentation to database schema matching

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

DB-GPT: Large Language Model Meets Database

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation