Alignment-based extraction of multiword expressions

de Caseli, Helena Medeiros; Ramisch, Carlos; das Graças Volpe Nunes, Maria; Villavicencio, Aline

doi:10.1007/s10579-009-9097-9

Alignment-based extraction of multiword expressions

Published: 14 August 2009

Volume 44, pages 59–77, (2010)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Helena Medeiros de Caseli¹,
Carlos Ramisch²,
Maria das Graças Volpe Nunes³ &
…
Aline Villavicencio^2,4

523 Accesses
Explore all metrics

Abstract

Due to idiosyncrasies in their syntax, semantics or frequency, Multiword Expressions (MWEs) have received special attention from the NLP community, as the methods and techniques developed for the treatment of simplex words are not necessarily suitable for them. This is certainly the case for the automatic acquisition of MWEs from corpora. A lot of effort has been directed to the task of automatically identifying them, with considerable success. In this paper, we propose an approach for the identification of MWEs in a multilingual context, as a by-product of a word alignment process, that not only deals with the identification of possible MWE candidates, but also associates some multiword expressions with semantics. The results obtained indicate the feasibility and low costs in terms of tools and resources demanded by this approach, which could, for example, facilitate and speed up lexicographic work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using a Database of Multiword Expressions in Dependency Parsing

Joining Forces for Multiword Expression Identification

The Difficult Identification of Multiworld Expressions: From Decision Criteria to Annotated Corpora

Notes

Pesquisa FAPESP is available at http://www.revistapesquisa.fapesp.br.
Apertium is an open-source machine translation engine and toolbox available at: http://www.apertium.org.
http://www-igm.univ-mlv.fr/~unitex/.
For example: “artesian wells”, “black hole” and “botanical gardens” are found in CIDE, “clean up”, “consist of” and “depend on” are found in CIDPV.
Evert and Krenn (2005) give a detailed description of standard measures and their application to MWE identification, and more material may also be found on http://www.collocations.de.

References

Armentano-Oller, C., Carrasco, R. C., Corbí-Bellot, A. M., Forcada, M. L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Ramírez-Sánchez, G., Sánchez-Martínez, F., & Scalco, M. A. (2006). Open-source Portuguese–Spanish machine translation. In Proceedings of the VII Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada (PROPOR-2006), Itatiaia-RJ, Brazil (pp. 50–59).
Baldwin, T., & Villavicencio, A. (2002). Extracting the unextractable: A case study on verb–particles. In Proceedings of the 6th conference on natural language learning (CoNLL-2002), Taipei, Taiwan.
Briscoe, T., & Carroll, J. (2002). Robust accurate statistical annotation of general text. In Proceedings of LREC-2003.
Brown, P., Della-Pietra, V., Della-Pietra, S., & Mercer, R. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–312.
Google Scholar
Burnard, L. (2000). User Reference Guide for the British National Corpus. Technical report. Oxford, UK: Oxford University Computing Services.
Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistics. Computational Linguistics 22(2), 249–254.
Google Scholar
Caseli, H. M., Nunes, M. G. V., & Forcada, M. L. (2006). Automatic induction of bilingual resources from aligned parallel corpora: Application to shallow-transfer machine translation. Machine Translation 20, 227–245.
Article Google Scholar
Caseli, H. M., Silva, A. M. P., & Nunes, M. G. V. (2004). Evaluation of methods for sentence and lexical alignment of Brazilian Portuguese and English parallel texts. In Proceedings of the SBIA 2004 (LNAI), Berlin, Heidelberg (pp. 184–193).
Evert, S., & Krenn, B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language 19(4), 450–466.
Google Scholar
Fazly, A., & Stevenson, S. (2007). Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In Proceedings of the workshop on a broader perspective on multiword expressions, Prague (pp. 9–16).
Hofland, K. (1996). A program for aligning English and Norwegian sentences. In S. Hockey, N. Ide, & G. Perissinotto (Eds.), Research in humanities computing (pp. 165–178). Oxford: Oxford University Press.
Jackendoff, R. (1997). ‘Twistin’ the night away. Language 73, 534–559.
Article Google Scholar
Melamed, I. D. (1997). Automatic discovery of non-compositional compounds in parallel data. In eprint arXiv:cmp-lg/9706027, pp. 6027.
Och, F. J., & Ney, H. (2000a). A comparison of alignment models for statistical machine translation. In Proceedings of the 18th international conference on computational linguistics (COLING−2000), Saarbrücken, Germany (pp. 1086–1090).
Och, F. J., & Ney, H. (2000b). Improved statistical alignment models. In Proceedings of the 38th annual meeting of the ACL, Hong Kong, China (pp. 440–447).
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51.
Article Google Scholar
Pearce, D. (2002). A comparative evaluation of collocation extraction techniques. In Proceedings of the third international conference on language resources and evaluation, Las Palmas, Canary Islands, Spain (pp. 1–7).
Piao, S. S. L., Sun, G., Rayson, P., & Yuan, Q. (2006). Automatic extraction of Chinese multiword expressions with a statistical tool. In Proceedings of the workshop on multi-word-expressions in a multilingual context (EACL-2006), Trento, Italy (pp. 17–24).
Procter, P. (1995). Cambridge international dictionary of English. Cambridge: Cambridge University Press.
Google Scholar
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the third international conference on computational linguistics and intelligent text processing (CICLing-2002), Lecture Notes in Computer Science, London, UK, Vol. 2276 (pp. 1–15).
Van de Cruys, T., & Villada Moirón, B. (2007). Semantics-based multiword expression extraction. In Proceedings of the workshop on a broader perspective on multiword expressions, Prague (pp. 25–32).
Villada Moirón, B., & Tiedemann, J. (2006). Identifying idiomatic expressions using automatic word-alignment. In Proceedings of the workshop on multi-word-expressions in a multilingual context (EACL-2006), Trento, Italy (pp. 33–40).
Villavicencio, A. (2005). The availability of verb–particle constructions in lexical resources: How much is enough? Journal of Computer Speech and Language Processing 19, 415–432.
Google Scholar
Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., & Ramisch, C. (2007). Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague (pp. 1034–1043).
Vogel, S., Ney, H., & Tillmann, C. (1996) HMM-based word alignment in statistical translation. In Proceedings of the 16th international conference on computational linguistics (COLING-1996), Copenhagen (pp. 836–841).
Zhang, Y., Kordoni, V., Villavicencio, A., & Idiart, M. (2006). Automated multiword expression prediction for grammar engineering. In Proceedings of the workshop on multiword expressions: Identifying and exploiting underlying properties, Sydney, Australia (pp. 36–44).

Download references

Acknowledgements

We thank the financial support of the Brazilian agencies FAPESP (02/13207-8) CNPq (550388/2005-2), SEBRAE/FINEP (1194/07) and CAPES (CAPES/COFECUB 548/07). We also thank Mônica Saddy Martins for helping in the evaluation process, and the anonymous reviewers for the useful comments.

Author information

Authors and Affiliations

NILC, Department of Computer Science, Federal University of São Carlos, São Carlos, Brazil
Helena Medeiros de Caseli
Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil
Carlos Ramisch & Aline Villavicencio
NILC, ICMC, University of São Paulo, São Carlos, Brazil
Maria das Graças Volpe Nunes
Department of Computer Science, University of Bath, Bath, UK
Aline Villavicencio

Authors

Helena Medeiros de Caseli
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Ramisch
View author publications
You can also search for this author in PubMed Google Scholar
Maria das Graças Volpe Nunes
View author publications
You can also search for this author in PubMed Google Scholar
Aline Villavicencio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aline Villavicencio.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Caseli, H.M., Ramisch, C., das Graças Volpe Nunes, M. et al. Alignment-based extraction of multiword expressions. Lang Resources & Evaluation 44, 59–77 (2010). https://doi.org/10.1007/s10579-009-9097-9

Download citation

Received: 20 November 2007
Accepted: 14 July 2009
Published: 14 August 2009
Issue Date: April 2010
DOI: https://doi.org/10.1007/s10579-009-9097-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Alignment-based extraction of multiword expressions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Using a Database of Multiword Expressions in Dependency Parsing

Joining Forces for Multiword Expression Identification

The Difficult Identification of Multiworld Expressions: From Decision Criteria to Annotated Corpora

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Alignment-based extraction of multiword expressions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Using a Database of Multiword Expressions in Dependency Parsing

Joining Forces for Multiword Expression Identification

The Difficult Identification of Multiworld Expressions: From Decision Criteria to Annotated Corpora

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation