Abstract
Due to idiosyncrasies in their syntax, semantics or frequency, Multiword Expressions (MWEs) have received special attention from the NLP community, as the methods and techniques developed for the treatment of simplex words are not necessarily suitable for them. This is certainly the case for the automatic acquisition of MWEs from corpora. A lot of effort has been directed to the task of automatically identifying them, with considerable success. In this paper, we propose an approach for the identification of MWEs in a multilingual context, as a by-product of a word alignment process, that not only deals with the identification of possible MWE candidates, but also associates some multiword expressions with semantics. The results obtained indicate the feasibility and low costs in terms of tools and resources demanded by this approach, which could, for example, facilitate and speed up lexicographic work.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10579-009-9097-9/MediaObjects/10579_2009_9097_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10579-009-9097-9/MediaObjects/10579_2009_9097_Fig2_HTML.gif)
Similar content being viewed by others
Notes
Pesquisa FAPESP is available at http://www.revistapesquisa.fapesp.br.
Apertium is an open-source machine translation engine and toolbox available at: http://www.apertium.org.
For example: “artesian wells”, “black hole” and “botanical gardens” are found in CIDE, “clean up”, “consist of” and “depend on” are found in CIDPV.
Evert and Krenn (2005) give a detailed description of standard measures and their application to MWE identification, and more material may also be found on http://www.collocations.de.
References
Armentano-Oller, C., Carrasco, R. C., Corbí-Bellot, A. M., Forcada, M. L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Ramírez-Sánchez, G., Sánchez-Martínez, F., & Scalco, M. A. (2006). Open-source Portuguese–Spanish machine translation. In Proceedings of the VII Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada (PROPOR-2006), Itatiaia-RJ, Brazil (pp. 50–59).
Baldwin, T., & Villavicencio, A. (2002). Extracting the unextractable: A case study on verb–particles. In Proceedings of the 6th conference on natural language learning (CoNLL-2002), Taipei, Taiwan.
Briscoe, T., & Carroll, J. (2002). Robust accurate statistical annotation of general text. In Proceedings of LREC-2003.
Brown, P., Della-Pietra, V., Della-Pietra, S., & Mercer, R. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–312.
Burnard, L. (2000). User Reference Guide for the British National Corpus. Technical report. Oxford, UK: Oxford University Computing Services.
Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistics. Computational Linguistics 22(2), 249–254.
Caseli, H. M., Nunes, M. G. V., & Forcada, M. L. (2006). Automatic induction of bilingual resources from aligned parallel corpora: Application to shallow-transfer machine translation. Machine Translation 20, 227–245.
Caseli, H. M., Silva, A. M. P., & Nunes, M. G. V. (2004). Evaluation of methods for sentence and lexical alignment of Brazilian Portuguese and English parallel texts. In Proceedings of the SBIA 2004 (LNAI), Berlin, Heidelberg (pp. 184–193).
Evert, S., & Krenn, B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language 19(4), 450–466.
Fazly, A., & Stevenson, S. (2007). Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In Proceedings of the workshop on a broader perspective on multiword expressions, Prague (pp. 9–16).
Hofland, K. (1996). A program for aligning English and Norwegian sentences. In S. Hockey, N. Ide, & G. Perissinotto (Eds.), Research in humanities computing (pp. 165–178). Oxford: Oxford University Press.
Jackendoff, R. (1997). ‘Twistin’ the night away. Language 73, 534–559.
Melamed, I. D. (1997). Automatic discovery of non-compositional compounds in parallel data. In eprint arXiv:cmp-lg/9706027, pp. 6027.
Och, F. J., & Ney, H. (2000a). A comparison of alignment models for statistical machine translation. In Proceedings of the 18th international conference on computational linguistics (COLING−2000), Saarbrücken, Germany (pp. 1086–1090).
Och, F. J., & Ney, H. (2000b). Improved statistical alignment models. In Proceedings of the 38th annual meeting of the ACL, Hong Kong, China (pp. 440–447).
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51.
Pearce, D. (2002). A comparative evaluation of collocation extraction techniques. In Proceedings of the third international conference on language resources and evaluation, Las Palmas, Canary Islands, Spain (pp. 1–7).
Piao, S. S. L., Sun, G., Rayson, P., & Yuan, Q. (2006). Automatic extraction of Chinese multiword expressions with a statistical tool. In Proceedings of the workshop on multi-word-expressions in a multilingual context (EACL-2006), Trento, Italy (pp. 17–24).
Procter, P. (1995). Cambridge international dictionary of English. Cambridge: Cambridge University Press.
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the third international conference on computational linguistics and intelligent text processing (CICLing-2002), Lecture Notes in Computer Science, London, UK, Vol. 2276 (pp. 1–15).
Van de Cruys, T., & Villada Moirón, B. (2007). Semantics-based multiword expression extraction. In Proceedings of the workshop on a broader perspective on multiword expressions, Prague (pp. 25–32).
Villada Moirón, B., & Tiedemann, J. (2006). Identifying idiomatic expressions using automatic word-alignment. In Proceedings of the workshop on multi-word-expressions in a multilingual context (EACL-2006), Trento, Italy (pp. 33–40).
Villavicencio, A. (2005). The availability of verb–particle constructions in lexical resources: How much is enough? Journal of Computer Speech and Language Processing 19, 415–432.
Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., & Ramisch, C. (2007). Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague (pp. 1034–1043).
Vogel, S., Ney, H., & Tillmann, C. (1996) HMM-based word alignment in statistical translation. In Proceedings of the 16th international conference on computational linguistics (COLING-1996), Copenhagen (pp. 836–841).
Zhang, Y., Kordoni, V., Villavicencio, A., & Idiart, M. (2006). Automated multiword expression prediction for grammar engineering. In Proceedings of the workshop on multiword expressions: Identifying and exploiting underlying properties, Sydney, Australia (pp. 36–44).
Acknowledgements
We thank the financial support of the Brazilian agencies FAPESP (02/13207-8) CNPq (550388/2005-2), SEBRAE/FINEP (1194/07) and CAPES (CAPES/COFECUB 548/07). We also thank Mônica Saddy Martins for helping in the evaluation process, and the anonymous reviewers for the useful comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
de Caseli, H.M., Ramisch, C., das Graças Volpe Nunes, M. et al. Alignment-based extraction of multiword expressions. Lang Resources & Evaluation 44, 59–77 (2010). https://doi.org/10.1007/s10579-009-9097-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-009-9097-9