Abstract
This paper presents an approach to build a novel two-level collocation net, which enables calculation of the collocation relationship between any two words, from a large raw corpus. The first level consists of atomic classes (each atomic class consists of one word and feature bigram), which are clustered into the second level class set. Each class in both levels is represented by its collocation candidate distribution, extracted from the linguistic analysis of the raw training corpus, over possible collocation relation types. In this way, all the information extracted from the linguistic analysis is kept in the collocation net. Our approach applies to both frequently and less-frequently occurring words by providing a clustering mechanism resolve the data sparseness problem through the collocation net. Experimentation shows that the collocation net is efficient and effective in solving the data sparseness problem and determining the collocation relationship between any two words.
Keywords
- Parse Tree
- Linguistic Analysis
- Atomic Class
- Data Sparseness Problem
- Statistical Natural Language Processing
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Church, K.W., Patrick, H.: Word Association Norms, Mutural Information and Lexicography. In: ACL 1989, pp. 76–83 (1989)
Church, K.W., William, A.G.: A Comparison of the Enhanced Good Turing and Deleted Estimation Methods for Estimating Probabilities of English Bigrams. Computer, Speech and Language 5(1), 19–54 (1991)
Church, K.W., Robert, L.M.: Introduction to Special Issue on Computational Linguistics Using Large Corpora. Computational Linguistics 19(1), 1–24 (1993)
Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
Halliday, M.: Lexis as a linguistic level. In: Bazell, C., Catford, J., Halliday, M., Robins, R. (eds.) memory of J.R.Firth, Longman (1966)
Hindle, D., Rooth, M.: Structural Ambiguity and Lexical Relations. Computational Linguistics 19(1), 102–119 (1993)
Justeson, J.S., Katz, S.M.: Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text. Natural Language Engineering 1(1), 9–27 (1995)
Julian, K., Pederson, J., Chen, F.: A Trainable Document Summarizer. In: SIGIR 1995, pp. 68–73 (1995)
Manning, C.D., Schutze, H.: Fundations of Statistical Natural Language Processing, p. 185. MIT Press, Cambridge (1999)
Meyer, D., et al.: Loci of Contextual Effects on Visual Word Recognition. In: Rabbitt, P., Dornie, S. (eds.) Attention and Performance V, pp. 98–116. Academic Press, London (1975)
Ross, I.C., Tukey, J.W.: Introduction to these Volumes. In: Tukey, J.W. (ed.) Index to Statistics amd Probability, pp. Iv-x. R&D Press, Los Altos (1975)
Rosenfeld, R.: Adaptive Statistical Language Modeling: A Maximum Entropy Approach. Ph.D. Thesis, Carneige Mellon University (1994)
Smadja, F.: Retrieving Collocations from Text: Xtract. Computational Linguistics 19(1), 143–177 (1993)
Snedecor, G.W., William, G.C.: Statistical Methods, p. 127. Iowa State University Press, Ames (1989)
Yang, J.: Towards the automatic Acquisition of Lexical Selection Rules. MT Summit VII, Singapore, pp. 397–403 (1999)
Yuret, D.: Discovery of Linguistic Relations Using Lexical Attraction. Ph.D thesis. cmp-lg/9805009. MIT (1998)
Zhao, J., Huang, C.N.: Aquasi-Dependency Model for the Structural Analysis of Chinese BaseNPs. In: COLING-ACL 1998, Univ. de Montreal, Canada, pp. 1–7 (1998)
Zhou, G.D., Lua, K.T.: Word Association and MI-Trigger-based Language Modeling. In: COLING-ACL 1998, Univ. of Montreal, Canada, pp. 1465–1471 (1998)
Zhou, G.D., Lua, K.T.: Interpolation of N-gram and MI-based Trigger Pair Language Modeling in Mandarin Speech Recognition. Computer, Speech and Language 13(2), 123–135 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhou, G., Zhang, M., Fu, G. (2006). Building a Collocation Net. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_56
Download citation
DOI: https://doi.org/10.1007/11940098_56
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49667-0
Online ISBN: 978-3-540-49668-7
eBook Packages: Computer ScienceComputer Science (R0)