Abstract
Fixed multiword expressions are strings of words which together behave like a single word. This research establishes a method for the automatic extraction of such expressions. Our method involves three stages. In the first, a statistical measure is used to extract candidate bigrams. In the second, we use this list to select occurrences of candidate expressions in a corpus, together with their surrounding contexts. These examples are used as training data for supervised machine learning, resulting in a classifier which can identify target multiword expressions. The final stage is the estimation of the part of speech of each extracted expression based on its context of occurence. Evaluation demonstrated that collocation measures alone are not effective in identifying target expressions. However, when trained on one million examples, the classifier identified target multiword expressions with precision greater than 90%. Part of speech estimation had precision and recall of over 95%.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Church, K., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16, 22–29 (1990)
Smadja, F.: Retrieving collocations from text: Xtract. Computational Linguistics 19, 143–177 (1993); Special Issue on Using Large Corpora: I
Thanopoulos, A., Fakotakis, N., Kokkinakis, G.: Comparative evaluation of collocation extraction metrics. In: International Conference on Language Resources and Evaluation (LREC-2002), pp. 620–625 (2002)
Sag, I., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: A pain in the neck for NLP. In: Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2002), Mexico City, Mexico, CICLING, pp. 1–15 (2002)
Baldwin, T., Villavicencio, A.: Extracting the unextractable: A case study on verb-particles. In: Roth, D., van den Bosch, A. (eds.) Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), Taipei, Taiwan, pp. 98–104 (2002)
Villavicencio, A.: Verb-particle constructions and lexical resources. In: Bond, F., Korhonen, A., McCarthy, D., Villavicencio, A. (eds.) Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, ACL, pp. 57–64 (2003)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 61–74 (1993)
Collins, M.: Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania (1999)
Yamada, H., Matsumoto, Y.: Statistical dependency analysis with support vector machines. In: IWPT 2003: 8th International Workshop on Parsing Technologies, pp. 195–206 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hore, C., Asahara, M., Matsumoto, Y. (2005). Automatic Extraction of Fixed Multiword Expressions. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_50
Download citation
DOI: https://doi.org/10.1007/11562214_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)