Abstract
We evaluate the possibility to learn, in an unsupervised manner, a list of idiomatic word combinations of the type preposition + noun phrase + preposition (P NP P), namely, such groups with three or more simple forms that behave as a whole lexical unit and have semantic and syntactic properties not deducible from the corresponding properties of each simple form, e.g., by means of, in order to, in front of. We show that idiomatic P NP P combinations have some statistical properties distinct from those of usual idiomatic collocations. In particular, we found that most frequent P NP P trigrams tend to be idiomatic. Of other statistical measures, log-likelihood performs almost as good as frequency for detecting idiomatic expressions of this type, while chi-square and point-wise mutual information perform very poor. We experiment on Spanish material.
Work partially supported by Mexican Government (CONACyT, SNI, CGPI-IPN, PIFI-IPN).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Banerjee, S., Pedersen, T.: The Design, Implementation, and Use of the Ngram Statistics Package. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588. Springer, Heidelberg (2003), http://www.d.umn.edu/~tpederse/nsp.html
Degand, L., Bestgen, Y.: Towards automatic retrieval of idioms in French newspaper corpora. Literary and Linguistic Computing 18(3), 249–259 (2003)
Evert, S., Krenn, B.: Methods for the Qualitative Evaluation of Lexical Association. In: Proc. ACL 2001, pp. 188–195 (2001)
Galicia-Haro, S.N.: Using Electronic Texts for an Annotated Corpus Building. In: 4th Mexican International Conference on Computer Science, ENC 2003, Mexico, pp. 26–33 (2003)
Justeson, J.S., Katz, S.M.: Technical Terminology: Some Linguistic properties and an algorithm for identification in text. Natural Language Engineering 1, 9–27 (1995)
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)
Nañez Fernández, E.: Diccionario de construcciones sintácticas del español. Preposiciones. Editorial de la Universidad Autónoma de Madrid (1995)
Rayson, P., Berridge, D., Francis, B.: Extending the Cochran rule for the comparison of word frequencies between corpora. In: Purnelle, G., et al. (eds.) Le poids des mots: Proc. of 7th International Conf. on Statistical analysis of textual data, JADT 2004 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Galicia-Haro, S.N., Gelbukh, A. (2005). Unsupervised Learning of P NP P Word Combinations. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_37
Download citation
DOI: https://doi.org/10.1007/978-3-540-30586-6_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)