Skip to main content
Log in

Applications of corpus-based semantic similarity and word segmentation to database schema matching

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

In this paper, we present a method for database schema matching: the problem of identifying elements of two given schemas that correspond to each other. Schema matching is useful in e-commerce exchanges, in data integration/warehousing, and in semantic web applications. We first present two corpus-based methods: one method is for determining the semantic similarity of two target words and the other is for automatic word segmentation. Then we present a name-based element-level database schema matching method that exploits both the semantic similarity and the word segmentation methods. Our word similarity method uses pointwise mutual information (PMI) to sort lists of important neighbor words of two target words; the words which are common in both lists are selected and their PMI values are aggregated to calculate the relative similarity score. Our word segmentation method uses corpus type frequency information to choose the type with maximum length and frequency from “desegmented” text. It also uses a modified forward–backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. Finally, we exploit both the semantic similarity and the word segmentation methods in our proposed name-based element-level schema matching method. This method uses a single property (i.e., element name) for schema matching and nevertheless achieves a measure score that is comparable to the methods that use multiple properties (e.g., element name, text description, data instance, context description). Our schema matching method also uses normalized and modified versions of the longest common subsequence string matching algorithm with weight factors to allow for a balanced combination. We validate our methods with experimental studies, the results of which suggest that these methods can be a useful addition to the set of existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Allison L. and Dix T.I. (1986). A bit-string longest-common-Subsequence algorithm. Inf. Process. Lett. 23: 305–310

    Article  MathSciNet  Google Scholar 

  2. Batini C., Lenzerini M. and Navathe S.B. (1986). A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18(4): 323–364

    Article  Google Scholar 

  3. Bell G.S. and Sethi A. (2001). Matching records in a national medical patient index. Commun. ACM (CACM) 44(9): 83–88

    Article  Google Scholar 

  4. Brent M. (1999). An efficient, probabilistically sound algorithm for segmentation and word discovery. Mach. Learning 34: 71–106

    Article  MATH  Google Scholar 

  5. Brent M. and Cartwright T. (1996). Distributional regularity and phonotactics are useful for segmentation. Cognition 61: 93–125

    Article  Google Scholar 

  6. Bright M.W., Hurson A.R. and Pakzad S.H. (1994). Automated resolution of semantic heterogeneity in multi databases. Trans. Database Systems (TODS) 19(2): 212–253

    Article  Google Scholar 

  7. Brill, E.: Some advances in transformation-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pp. 748–753. AAI Press/MIT Press (1994)

  8. Brown P.F., DeSouza P.V., Mercer R.L., Watson T.J., Della Pietra V.J. and Lai J.C. (1992). Class-based n-gram models of natural language. Comput. Linguist. 18: 467–479

    Google Scholar 

  9. Budanitsky, A., Hirst, G.: Evaluating WordNet-based measures of semantic distance. Comput. Linguist. 32(1) (2006)

  10. Buckley, C., Salton, J.A., Singhal, A.: Automatic query expansion using Smart: TREC 3. In Proceedings of the Third Text Retrieval Conference, Gaithersburg (1995)

  11. Christiansen, M., Allen, J.: Coping with Variation in Speech Segmentation. In: Proceedings of GALA 1997: Language Acquisition: Knowledge Representation and Processing, pp. 327–332 (1997)

  12. Christiansen M., Allen J. and Seidenberg M. (1998). Learning to segment speech using multiple cues: a connectionist model. Language Cogn. Process. 13: 221–268

    Article  Google Scholar 

  13. Church K.W. and Hanks P. (1990). Word association norms, mutual information and lexicography. Comput. Linguist. 16(1): 22–29

    Google Scholar 

  14. Daelamans W., van den Bosch A. and Weijters A. (1997). IGTree: Using trees for compression and classification in lazy learning algorithms. Artif. Intell. Rev. 11: 407–423

    Article  Google Scholar 

  15. Dagan I., Lee L. and Pereira F.C.N. (1999). Similarity based models of word cooccurrence probabilities. Mach. Learning 34(1–3): 43–69

    Article  MATH  Google Scholar 

  16. Dale R., Moisl H. and Somers H. (2000). Handbook of Natural Language Processing. Marcel Dekker, Inc., New York

    Google Scholar 

  17. Deligne, S., Bimbot, F.: Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP-95) (1995)

  18. Marcken C. (1995). The unsupervised acquisition of a lexicon from continuous speech. M.I.T., Cambridge, MA, Technical Report AI Memo No. 1558

    Google Scholar 

  19. Do, H.H., Rahm, E.: COMA—a system for flexible combination of schema matching approaches. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 610–621 (2002)

  20. Dunning T. (1993). Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19: 61–74

    Google Scholar 

  21. Fung, P., Wu, D.: Improving Chinise tokenization with linguistic filters on statistical lexical acquisition. In: Fourth Conference Applied Natural Language Processing, Stuttgart, pp. 180–181 (1994)

  22. Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named entity recognition: a pragmatic approach. Comput. Linguist. 31(4) (2005)

  23. Grefenstette, G.: Automatic thesaurus generation from raw text using knowledge-poor techniques. In: Making sense of Words, 9th Annual Conference of the UW Centre for the New OED and Text Research (1993)

  24. Grefenstette, G.: Finding semantic similarity in raw text: the deese antonyms. In: Goldman, R., Norvig, P., Charniak, E., Gale, B. (eds.) Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language, pp. 61–65. AAAI Press (1992)

  25. Hua, Y.: Unsupervised word induction using MDL criterion. In: Proceedings ISCSL 2000, Beijing (2000)

  26. Inkpen, D., Désilets, A.: Semantic similarity for detecting recognition errors in automatic speech transcripts. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP 2005), Vancouver (2005)

  27. Jarmasz, M., Szpakowicz, S.: Roget’s thesaurus and semantic similarity. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-2003), Borovets, Bulgaria, pp. 212–219 (2003)

  28. Kang, J., Naughton, J.F.: On schema matching with opaque column names and data values. In: Proceedings of Special Interest Group on Management of Data (SIGMOD 2003), San Diego, pp. 205–216 (2003)

  29. Kit, C., Wilks, Y.: Unsupervised learning of word boundary with description length gain. In: Proceedings CoNLL99 ACL Workshop, Bergen (1999)

  30. Kondrak, G.: N-gram similarity and distance. In: Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), Buenos Aires, pp. 115–126 (2005)

  31. Landauer T.K. and Dumais S.T. (1997). A solution to plato’s problem: the latent semantic analysis theory of the acquisition, induction and representation of knowledge. Psychol. Rev. 104(2): 211–240

    Article  Google Scholar 

  32. Lee, L.: Measures of distributional similarity. In: Proceedings of the Association for Computational Linguistics (ACL-1999), pp. 23–32 (1999)

  33. Lesk M.E. (1969). Word-word associations in document retrieval systems. Am. Doc. 20(1): 27–38

    Article  Google Scholar 

  34. Lesk, M.E.: Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In: Proceedings of the Conference of Special Interest Group on Design Of Communication (SIGDOC), Toronto (1986)

  35. Li, H., Abe, N.: Word clustering and disambiguation based on co-occurrence data. In: Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING-ACL), pp. 749–755 (1998)

  36. Lin, C.Y., Hovy, E.H. (2003) Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of Human Language Technology Conference (HLT-NAACL 2003), Edmonton

  37. Lin, D.: Automatic retrieval and clustering of similar words. In Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING-ACL), pp. 768–774 (1998)

  38. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304 (1998)

  39. Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based Schema Matching. In: International Conference on Data Engineering (ICDE-05), pp. 57–68 (2005)

  40. Melamed I.D. (1999). Bitext maps and alignment via pattern recognition. Comput. Linguist. 25(1): 107–130

    Google Scholar 

  41. Mikheev A. (1997). Automatic rule induction for unknown word guessing. Comput. Linguist. 23(3): 405–423

    Google Scholar 

  42. Miller G.A. and Charles W.G. (1991). Contextual correlates of semantic similarity. Language Cogn. Process. 6(1): 1–28

    Article  Google Scholar 

  43. Miller G.A. (1995). WordNet: a lexical database for English. Commun. ACM 38(11): 39–41

    Article  Google Scholar 

  44. Milo, T., Zohar, S.: Using Schema matching to simplify heterogeneous data translation. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 122–133 (1998)

  45. Pantel, P., Lin, D.: Discovering word senses from text. In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 613–619 (2002)

  46. Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet: Similarity—measuring the relatedness of concepts. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), July 25–29, San Jose (Intelligent Systems Demonstration)

  47. Peng, F., Schuurmans, D.: A hierarchical EM approach to word segmentation. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001). pp. 475–480, Tokyo, Japan (2001)

  48. Rabiner L. (1989). A tutorial on hidden markov models and selected applications in speech recognition. In: Proc. IEEE 77(2): 257–286

    Article  Google Scholar 

  49. Rahm E. and Bernstein P.A. (2001). A survey of approaches to automatic schema matching. Int. J. Very Large Data Bases (VLDB) 10(4): 334–350

    Article  MATH  Google Scholar 

  50. Rao C.R. (1983). Diversity: Its measurement, decomposition, apportionment and analysis. Sankyha: Indian J. Statist. 44(A): 1–22

    Google Scholar 

  51. Resnik, P.: Using information content to evaluate semantic similarity. In Proceedings of 14th International Joint Conference on Artificial Intelligence, Montreal, pp. 448–453 (1995)

  52. Rosenfeld R. (1996). A maximum entropy approach to adaptive statistical language modeling. Comput Speech Language 10: 187–228

    Article  Google Scholar 

  53. Rubenstein H. and Goodenough J.B. (1965). Contextual correlates of synonymy. Commun. ACM 8(10): 627–633

    Article  Google Scholar 

  54. Rumelhart, D.E., McClelland, J.: On learning the past tense of English verbs. In: Parallel Distributed Processing, vol. II, pp. 216–271, MIT Press, Cambridge, (1986)

  55. Saffran J.R., Newport E.L. and Aslin R.N. (1996). Word segmentation: The role of distributional cues. J. Memory Language 35: 606–621

    Article  Google Scholar 

  56. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill (1983)

  57. Seligman, L., Rosenthal, A., Lehner, P., Smith, A.: Data integration: Where does the time go? Bull. Tech. Comm. Data Eng. 25(3) (2002)

  58. Shannon C.E. and Weaver W. (1963). The mathematical theory of communication. University of Illinois Press, Urbana

    MATH  Google Scholar 

  59. Sproat R., Shih C., Gale W. and Chang N. (1996). A stochastic finite-state word-segmentation algorithm for Chinese. Comput. Linguist. 22(3): 377–404

    Google Scholar 

  60. Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic word segmentation algorithm for a Mandarin text-to-speech system. In: 32nd Annual Meeting of the Association for Computational Linguistics pp. 66–72. Las Cruces (1994)

  61. Turney, P.D.: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In: Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), pp. 491–502 (2001)

  62. Vechtomova, O., Robertson, S.: Integration of collocation statistics into the probabilistic retrieval model. In: 22nd Annual Colloquium on Information Retrieval Research, Cambridge (2000)

  63. Weeds, J., Weir, D., McCarthy, D.: Characterising measures of lexical distributional similarity. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING-2004 pp. 1015–1021. Geneva (2004)

  64. Xu J. and Croft B. (2000). Improving the effectiveness of information retrieval. ACM Trans. Inf. Syst. 18(1): 79–112

    Article  Google Scholar 

  65. Yarowsky, D.: Word-sense disambiguation using statistical models of Roget’s categories trained on large corpora. In: Proceedings of the International Conference on Computational Linguistics (COLING-92), pp. 454–460, Nantes, (1992)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aminul Islam.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Islam, A., Inkpen, D. & Kiringa, I. Applications of corpus-based semantic similarity and word segmentation to database schema matching. The VLDB Journal 17, 1293–1320 (2008). https://doi.org/10.1007/s00778-007-0067-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-007-0067-9

Keywords

Navigation