A Hybrid Approach for Multiword Expression Identification

Ramisch, Carlos; de Medeiros Caseli, Helena; Villavicencio, Aline; Machado, André; Finatto, Maria José

doi:10.1007/978-3-642-12320-7_9

A Hybrid Approach for Multiword Expression Identification

Carlos Ramisch^24,25,
Helena de Medeiros Caseli²⁶,
Aline Villavicencio^25,27,
André Machado²⁵ &
…
Maria José Finatto²⁸

Conference paper

658 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6001))

Abstract

Considerable attention has been given to the problem of Multiword Expression (MWE) identification and treatment, for NLP tasks like parsing and generation, to improve the quality of results. Statistical methods have been often employed for MWE identification, as an inexpensive and language independent way of finding co-occurrence patterns. On the other hand, more linguistically motivated methods for identification, which employ information such as POS filters and lexical alignment between languages, can produce more targeted candidate lists. In this paper we propose a hybrid approach that combines the strenghts of different sources of information using a machine learning algorithm to produce more robust and precise results. Automatic evaluation on gold standards shows that the performance of our hybrid method is superior to the individual results of statistical and alignment-based MWE extraction approaches for Portuguese and for English. This method can be used to aid lexicographic work by providing a more targeted MWE candidate list.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword Expressions: A Pain in the Neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002)
Chapter Google Scholar
Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan, E.: Grammar of Spoken and Written English. Longman, Harlow (1999)
Google Scholar
Jackendoff, R.: Twistin’ the night away. Language 73, 534–559 (1997)
Article Google Scholar
Evert, S., Krenn, B.: Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language, Special Issue on Multiword Expressions 19(4), 450–466 (2005)
Google Scholar
Baldwin, T.: The deep lexical acquisition of English verb-particles. Computer Speech and Language, Special Issue on Multiword Expressions 19(4), 398–414 (2005)
Google Scholar
Caseli, H.M., Villavicencio, A., Machado, A., Finatto, M.J.: Statistically-driven alignment-based multiword expression identification for technical domains. In: Proceedings of the ACL-IJCNLP 2009 Workshop on Multiword Expressions, pp. 1–8 (2009)
Google Scholar
Villavicencio, A., Caseli, H.M., Machado, A.: Identification of Multiword Expressions in Technical Domains: Investigating Statistical and Alignment-based Approaches. In: Proceedings of the 7th Brazilian Symposium in Information and Human Language Technology, São Carlos, SP (2009)
Google Scholar
Fazly, A., Cook, P., Stevenson, S.: Unsupervised type and token identification of idiomatic expressions. Computational Linguistics 35(1), 61–103 (2009)
Article Google Scholar
Van de Cruys, T., Villada Moirón, B.: Semantics-based Multiword Expression Extraction. In: Proceedings of the ACL 2007 Workshop on Multiword Expressions: A Broader Prespective, Prague, pp. 25–32 (2007)
Google Scholar
Villada Moirón, B., Tiedemann, J.: Identifying idiomatic expressions using automatic word-alignment. In: Proceedings of the EACL 2006 Workshop on Multiword expressions in a Multilingual Context, Trento, Italy, pp. 33–40 (2006)
Google Scholar
Ramisch, C., Villavicencio, A., Moura, L., Idiart, M.: Picking them up and Figuring them out: Verb-Particle Constructions, Noise and Idiomaticity. In: Proceedings of the 12th Conference on Computational Natural Language Learning (CoNLL 2008), pp. 49–56 (2008)
Google Scholar
Melamed, I.D.: Automatic Discovery of Non-Compositional Compounds in Parallel Data (1997) eprint arXiv:cmp-lg/9706027
Google Scholar
Coulthard, R.J.: The application of corpus methodology to translation: the JPED parallel corpus and the Pediatrics comparable corpus. Master’s thesis, Universidade Federal de Santa Catarina (2005)
Google Scholar
Lopes, L., Vieira, R., Finatto, M.J., Martins, D., Zanette, A.: Automatic extraction of composite terms for construction of ontologies: an experiment in the health care area. RECIIS - Electronic Journal of communication information and innovation in healthq 3, 76–88 (2009)
Google Scholar
Procter, P.: Cambridge International Dictionary of English. Cambridge University Press, Cambridge (1995)
Google Scholar
Banerjee, S., Pedersen, T.: The Design, Implementation and Use of the Ngram Statistics Package. In: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, pp. 370–381 (2003)
Google Scholar
Och, F.J., Ney, H.: Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting of the ACL, Hong Kong, China, pp. 440–447 (2000)
Google Scholar
Armentano-Oller, C., Carrasco, R.C., Corbí-Bellot, A.M., Forcada, M.L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramírez-Sánchez, G., Sánchez-Martínez, F., Scalco, M.A.: Open-source Portuguese-Spanish machine translation. In: Vieira, R., Quaresma, P., Nunes, M.d.G.V., Mamede, N.J., Oliveira, C., Dias, M.C. (eds.) PROPOR 2006. LNCS (LNAI), vol. 3960, pp. 50–59. Springer, Heidelberg (2006)
Chapter Google Scholar
Caseli, H.M., Nunes, M.G.V., Forcada, M.L.: On the automatic learning of bilingual resources: Some relevant factors for machine translation. In: Zaverucha, G., da Costa, A.L. (eds.) SBIA 2008. LNCS (LNAI), vol. 5249, pp. 258–267. Springer, Heidelberg (2008)
Chapter Google Scholar
Caseli, H.M., Ramisch, C., Nunes, M.G.V., Villavicencio, A.: Alignment-based extraction of multiword expressions. Language Resources and Evaluation (2009) (to appear)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

GETALP/LIG, University of Grenoble, (France)
Carlos Ramisch
Institute of Informatics, Federal University of Rio Grande do Sul, (Brazil)
Carlos Ramisch, Aline Villavicencio & André Machado
Department of Computer Science, Federal University of São Carlos, (Brazil)
Helena de Medeiros Caseli
Department of Computer Sciences, Bath University, (UK)
Aline Villavicencio
Institute of Language and Linguistics, Federal University of Rio Grande do Sul, (Brazil)
Maria José Finatto

Authors

Carlos Ramisch
View author publications
You can also search for this author in PubMed Google Scholar
Helena de Medeiros Caseli
View author publications
You can also search for this author in PubMed Google Scholar
Aline Villavicencio
View author publications
You can also search for this author in PubMed Google Scholar
André Machado
View author publications
You can also search for this author in PubMed Google Scholar
Maria José Finatto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Núcleo Interinstitucional de Lingüística Computacional (NILC), Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, CP 668 13.560-970, São Carlos-SP, Brasil
Thiago Alexandre Salgueiro Pardo
Faculdade de Ciências de Lisboa, Departamento de Informática, Cidade Universitária, 1749-016, Lisboa, Portugal
António Branco
Signal Processing Laboratory, Universidade Federal do Pará, Rua Augusto Correa. 1, 660750110, Belém, PA, Brazil
Aldebaro Klautau
Pontifícia Universidade do Rio Grande do Sul, Porto Alegre, Brasil
Renata Vieira
Programa de Pós-Graduação em Ciência da Computação - PPGCC Avenida Ipiranga, 6681 - Prédio 32 - Partenon, CEP 90619-900, Porto Alegre, RS, Brasil
Vera Lúcia Strube de Lima

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ramisch, C., de Medeiros Caseli, H., Villavicencio, A., Machado, A., Finatto, M.J. (2010). A Hybrid Approach for Multiword Expression Identification. In: Pardo, T.A.S., Branco, A., Klautau, A., Vieira, R., de Lima, V.L.S. (eds) Computational Processing of the Portuguese Language. PROPOR 2010. Lecture Notes in Computer Science(), vol 6001. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12320-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-12320-7_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12319-1
Online ISBN: 978-3-642-12320-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics