Abstract
We present an extensive empirical evaluation of collocation extraction methods based on lexical association measures and their combination. The experiments are performed on three sets of collocation candidates extracted from the Prague Dependency Treebank with manual morphosyntactic annotation and from the Czech National Corpus with automatically assigned lemmas and part-of-speech tags. The collocation candidates were manually labeled as collocational or non-collocational. The evaluation is based on measuring the quality of ranking the candidates according to their chance to form collocations. Performance of the methods is compared by precision-recall curves and mean average precision scores. The work is focused on two-word (bigram) collocations only. We experiment with bigrams extracted from sentence dependency structure as well as from surface word order. Further, we study the effect of corpus size on the performance of the individual methods and their combination.
Similar content being viewed by others
Notes
An agreement measure for any numbers of annotators (Fleiss 1971): \(\kappa = {\frac{P_o\,-\,P_e}{1\,-\,P_e}},\) where P o is the relative observed agreement among annotators and P e is the theoretical probability of chance agreement (each annotator randomly choosing each category). The factor 1 − P e then corresponds to the level of agreement achievable above chance and P o − P e is the level of agreement actually achieved above chance. For two annotators the exact Fleiss’ \(\kappa\) reduces to the well known Cohen’s \(\kappa\) (Conger 1980).
References
Bartsch, S. (2004). Structural und functional properties of collocations in English. A corpus study of lexical and pragmatic constraints on lexical co-occurrence. Tübingen: Gunter Narr Verlag.
Berry-Rogghe, G. L. (1973). The computation of collocations and their relevance in lexical studies. In The computer and literal studies (pp. 103–112). Edinburgh, New York: University Press.
Choueka, Y. (1988). Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO.
Choueka, Y., Klein, S., & Neuwitz, E. (1983). Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing, 4(1), 34–38.
Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16(1), 22–29.
Conger, A. J. (1980). Integration and generalisation of Kappas for multiple raters. Psychological Bulletin, 88, 322–328
Daille, B. (1996). Study and implementation of combined techniques for automatic extraction of terminology. In J. L. Klavans & P. Resnik (Eds.), The balancing act (Chap. 3, pp. 49–66). Cambridge, MA: MIT Press.
Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. PhD Thesis, University of Stuttgart.
Evert, S., & Krenn, B. (2001). Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th annual meeting of the association for computational linguistics (pp. 188–195).
Fawcett, T. (2003). ROC graphs: Notes and practical considerations for data mining researchers. Technical Report, HPL 2003–4. Palo Alto CA: HP Laboratories.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 378–382.
Hajič, J. (2004). Disambiguation of rich inflection (computational morphology of Czech) (Vol. 1). Prague: Charles University Press.
Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391–1415.
Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, New York, NY.
Inkpen, D., & Hirst, G. (2002). Acquiring collocations for lexical choice between near synonyms. In SIGLEX workshop on unsupervised lexical acquisition, 40th meeting of the ACL, Philadelphia.
Kita, K., Kato, Y., Omoto, T., & Yano, Y. (1994). A comparative study of automatic extraction of collocations from corpora: Mutual information vs. cost criteria. Journal of Natural Language Processing, 1(1), 21–33.
Krenn, B. (2000). The usual suspects: Data-oriented models for identification and representation of lexical collocations. PhD Thesis, Saarland University.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, Massachusetts: The MIT Press, Chap. 5. Collocations.
Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In Proceedings of the 2004 conference on EMNLP. Barcelona, Spain
Palmer, H. E. (1938). A grammar of English words. London: Longman
PDT (2006). Prague dependency treebank 2.0. Institute of Formal and Applied Lingustics.
Pearce, D. (2002) A comparative evaluation of collocation extraction techniques. In Third international conference on language resources and evaluation. Spain, Las Palmas.
Pecina, P. (2008a). Machine learning approach to mutliword expression extraction. In Proceedings of the sixth international conference on language resources and evaluation workshop: Towards a shared task for multiword expressions (MWE 2008), Marrakech, Morocco.
Pecina, P. (2008b). Reference data for Czech collocation extraction. In Proceedings of the sixth international conference on language resources and evaluation workshop: Towards a shared task for multiword expressions (MWE 2008). Marrakech, Morocco.
Pecina, P., & Schlesinger, P. (2006) Combining association measures for collocation extraction. In Proceedings of the 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006). Sydney, Australia.
Shimohata, S., Sugio, T., Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In Proceedings of the 35th meeting of ACL/EACL (pp. 476–481). Madrid, Spain.
Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177
Stevens, M. E., Giuliano, V. E., & Heilprin, L. B. (Eds.), (1965). Proceedings of the symposium on statistical association methods for mechanized documentation (Vol. 269). Washington, DC: National Bureau of Standards Miscellaneous Publication.
Venables, W. N., & Ripley, B. (2002). Modern applied statistics with S (4th ed.). New York: Springer.
Zhai, C. (1997). Exploiting context to identify lexical atoms: A statistical view of linguistic context. In International and interdisciplinary conferences on modeling and using context.
Acknowledgments
This is a revised and extended version of our previous work (Pecina and Schlesinger 2006). Details on the reference data sets are described in (Pecina 2008a). Experiments that are performed on other data sets and confirm good results of our combination methods are presented in (Pecina 2008b). This work was supported by the Ministry of Education of the Czech Republic project MSM 0021620838.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Rights and permissions
About this article
Cite this article
Pecina, P. Lexical association measures and collocation extraction. Lang Resources & Evaluation 44, 137–158 (2010). https://doi.org/10.1007/s10579-009-9101-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-009-9101-4