Lexical association measures and collocation extraction

Pecina, Pavel

doi:10.1007/s10579-009-9101-4

Lexical association measures and collocation extraction

Published: 21 October 2009

Volume 44, pages 137–158, (2010)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Pavel Pecina¹

2801 Accesses
102 Citations
Explore all metrics

Abstract

We present an extensive empirical evaluation of collocation extraction methods based on lexical association measures and their combination. The experiments are performed on three sets of collocation candidates extracted from the Prague Dependency Treebank with manual morphosyntactic annotation and from the Czech National Corpus with automatically assigned lemmas and part-of-speech tags. The collocation candidates were manually labeled as collocational or non-collocational. The evaluation is based on measuring the quality of ranking the candidates according to their chance to form collocations. Performance of the methods is compared by precision-recall curves and mean average precision scores. The work is focused on two-word (bigram) collocations only. We experiment with bigrams extracted from sentence dependency structure as well as from surface word order. Further, we study the effect of corpus size on the performance of the individual methods and their combination.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Comparative Evaluation and Integration of Collocation Extraction Metrics

Collocation Candidate Extraction from Dependency-Annotated Corpora: Exploring Differences across Parsers and Dependency Annotation Schemes

A Three-Layered Collocation Extraction Tool and Its Application in China English Studies

Notes

An agreement measure for any numbers of annotators (Fleiss 1971): \(\kappa = {\frac{P_o\,-\,P_e}{1\,-\,P_e}},\) where P _o is the relative observed agreement among annotators and P _e is the theoretical probability of chance agreement (each annotator randomly choosing each category). The factor 1 − P _e then corresponds to the level of agreement achievable above chance and P _o − P _e is the level of agreement actually achieved above chance. For two annotators the exact Fleiss’ \(\kappa\) reduces to the well known Cohen’s \(\kappa\) (Conger 1980).

References

Bartsch, S. (2004). Structural und functional properties of collocations in English. A corpus study of lexical and pragmatic constraints on lexical co-occurrence. Tübingen: Gunter Narr Verlag.
Berry-Rogghe, G. L. (1973). The computation of collocations and their relevance in lexical studies. In The computer and literal studies (pp. 103–112). Edinburgh, New York: University Press.
Choueka, Y. (1988). Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO.
Choueka, Y., Klein, S., & Neuwitz, E. (1983). Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing, 4(1), 34–38.
Google Scholar
Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16(1), 22–29.
Google Scholar
Conger, A. J. (1980). Integration and generalisation of Kappas for multiple raters. Psychological Bulletin, 88, 322–328
Article Google Scholar
Daille, B. (1996). Study and implementation of combined techniques for automatic extraction of terminology. In J. L. Klavans & P. Resnik (Eds.), The balancing act (Chap. 3, pp. 49–66). Cambridge, MA: MIT Press.
Google Scholar
Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. PhD Thesis, University of Stuttgart.
Evert, S., & Krenn, B. (2001). Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th annual meeting of the association for computational linguistics (pp. 188–195).
Fawcett, T. (2003). ROC graphs: Notes and practical considerations for data mining researchers. Technical Report, HPL 2003–4. Palo Alto CA: HP Laboratories.
Google Scholar
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 378–382.
Article Google Scholar
Hajič, J. (2004). Disambiguation of rich inflection (computational morphology of Czech) (Vol. 1). Prague: Charles University Press.
Google Scholar
Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391–1415.
Google Scholar
Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, New York, NY.
Inkpen, D., & Hirst, G. (2002). Acquiring collocations for lexical choice between near synonyms. In SIGLEX workshop on unsupervised lexical acquisition, 40th meeting of the ACL, Philadelphia.
Kita, K., Kato, Y., Omoto, T., & Yano, Y. (1994). A comparative study of automatic extraction of collocations from corpora: Mutual information vs. cost criteria. Journal of Natural Language Processing, 1(1), 21–33.
Google Scholar
Krenn, B. (2000). The usual suspects: Data-oriented models for identification and representation of lexical collocations. PhD Thesis, Saarland University.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, Massachusetts: The MIT Press, Chap. 5. Collocations.
Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In Proceedings of the 2004 conference on EMNLP. Barcelona, Spain
Palmer, H. E. (1938). A grammar of English words. London: Longman
Google Scholar
PDT (2006). Prague dependency treebank 2.0. Institute of Formal and Applied Lingustics.
Pearce, D. (2002) A comparative evaluation of collocation extraction techniques. In Third international conference on language resources and evaluation. Spain, Las Palmas.
Pecina, P. (2008a). Machine learning approach to mutliword expression extraction. In Proceedings of the sixth international conference on language resources and evaluation workshop: Towards a shared task for multiword expressions (MWE 2008), Marrakech, Morocco.
Pecina, P. (2008b). Reference data for Czech collocation extraction. In Proceedings of the sixth international conference on language resources and evaluation workshop: Towards a shared task for multiword expressions (MWE 2008). Marrakech, Morocco.
Pecina, P., & Schlesinger, P. (2006) Combining association measures for collocation extraction. In Proceedings of the 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006). Sydney, Australia.
Shimohata, S., Sugio, T., Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In Proceedings of the 35th meeting of ACL/EACL (pp. 476–481). Madrid, Spain.
Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177
Google Scholar
Stevens, M. E., Giuliano, V. E., & Heilprin, L. B. (Eds.), (1965). Proceedings of the symposium on statistical association methods for mechanized documentation (Vol. 269). Washington, DC: National Bureau of Standards Miscellaneous Publication.
Venables, W. N., & Ripley, B. (2002). Modern applied statistics with S (4th ed.). New York: Springer.
Google Scholar
Zhai, C. (1997). Exploiting context to identify lexical atoms: A statistical view of linguistic context. In International and interdisciplinary conferences on modeling and using context.

Download references

Acknowledgments

This is a revised and extended version of our previous work (Pecina and Schlesinger 2006). Details on the reference data sets are described in (Pecina 2008a). Experiments that are performed on other data sets and confirm good results of our combination methods are presented in (Pecina 2008b). This work was supported by the Ministry of Education of the Czech Republic project MSM 0021620838.

Author information

Authors and Affiliations

Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic
Pavel Pecina

Authors

Pavel Pecina
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pavel Pecina.

Appendix

Table 3 The inventory of lexical association measures used for collocation extraction used in our experiments

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pecina, P. Lexical association measures and collocation extraction. Lang Resources & Evaluation 44, 137–158 (2010). https://doi.org/10.1007/s10579-009-9101-4

Download citation

Published: 21 October 2009
Issue Date: April 2010
DOI: https://doi.org/10.1007/s10579-009-9101-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Lexical association measures and collocation extraction

Abstract

Access this article

Similar content being viewed by others

Comparative Evaluation and Integration of Collocation Extraction Metrics

Collocation Candidate Extraction from Dependency-Annotated Corpora: Exploring Differences across Parsers and Dependency Annotation Schemes

A Three-Layered Collocation Extraction Tool and Its Application in China English Studies

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Lexical association measures and collocation extraction

Abstract

Access this article

Similar content being viewed by others

Comparative Evaluation and Integration of Collocation Extraction Metrics

Collocation Candidate Extraction from Dependency-Annotated Corpora: Exploring Differences across Parsers and Dependency Annotation Schemes

A Three-Layered Collocation Extraction Tool and Its Application in China English Studies

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation