Skip to main content

Automatic Idiom Recognition with Word Embeddings

  • Conference paper
  • First Online:
Information Management and Big Data (SIMBig 2015, SIMBig 2016)

Abstract

Expressions, such as add fuel to the fire, can be interpreted literally or idiomatically depending on the context they occur in. Many Natural Language Processing applications could improve their performance if idiom recognition were improved. Our approach is based on the idea that idioms and their literal counterparts do not appear in the same contexts. We propose two approaches: (1) Compute inner product of context word vectors with the vector representing a target expression. Since literal vectors predict well local contexts, their inner product with contexts should be larger than idiomatic ones, thereby telling apart literals from idioms; and (2) Compute literal and idiomatic scatter (covariance) matrices from local contexts in word vector space. Since the scatter matrices represent context distributions, we can then measure the difference between the distributions using the Frobenius norm. For comparison, we implement [8, 16, 24] and apply them to our data. We provide experimental results validating the proposed techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    These examples are extracted from the Corpus of Contemporary American English (COCA) (http://corpus.byu.edu/coca/).

References

  1. Birke, J., Sarkar, A.: A clustering approach to the nearly unsupervised recognition of nonliteral language. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), Trento, pp. 329–336 (2006)

    Google Scholar 

  2. Burnard, L.: The British National Corpus Users Reference Guide. Oxford University Computing Services, Oxford (2000)

    Google Scholar 

  3. Cacciari, C.: The place of idioms in a literal and metaphorical world. In: Cacciari, C., Tabossi, P. (eds.) Idioms: Processing, Structure, and Interpretation, pp. 27–53. Lawrence Erlbaum Associates, Hillsdale (1993)

    Google Scholar 

  4. Cilibrasi, R., Vitányi, P.M.B.: The Google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007)

    Article  Google Scholar 

  5. Cook, P., Fazly, A., Stevenson, S.: Pulling their weight: exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In: Proceedings of the ACL 2007 Workshop on A Broader Perspective on Multiword Expressions, pp. 41–48 (2007)

    Google Scholar 

  6. Cook, P., Fazly, A., Stevenson, S.: The VNC-tokens dataset. In: Proceedings of the LREC Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, June 2008

    Google Scholar 

  7. Cowie, A.P., Mackin, R., McCaig, I.R.: Oxford Dictionary of Current Idiomatic English, vol. 2. Oxford University Press, Oxford (1983)

    Google Scholar 

  8. Fazly, A., Cook, P., Stevenson, S.: Unsupervised type and token identification of idiomatic expressions. Comput. Linguist. 35(1), 61–103 (2009)

    Article  Google Scholar 

  9. Feldman, A., Peng, J.: Automatic detection of idiomatic clauses. In: Gelbukh, A. (ed.) CICLing 2013. LNCS, vol. 7816, pp. 435–446. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37247-6_35

    Chapter  Google Scholar 

  10. Fellbaum, C., Geyken, A., Herold, A., Koerner, F., Neumann, G.: Corpus-based studies of German idioms and light verbs. Int. J. Lexicogr. 19(4), 349–360 (2006)

    Article  Google Scholar 

  11. Firth, J.R.: A synopsis of linguistic theory, 1930–1955 (1957)

    Google Scholar 

  12. Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, New York (1990)

    MATH  Google Scholar 

  13. Glucksberg, S.: Idiom meanings and allusional content. In: Cacciari, C., Tabossi, P. (eds.) Idioms: Processing, Structure, and Interpretation, pp. 3–26. Lawrence Erlbaum Associates, Hillsdale (1993)

    Google Scholar 

  14. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics (COLING 1992), vol. 2, pp. 539–545. Association for Computational Linguistics, Stroudsburg (1992). http://dx.doi.org/10.3115/992133.992154

  15. Katz, G., Giesbrecht, E.: Automatic identification of non-compositional multiword expressions using latent semantic analysis. In: Proceedings of the ACL/COLING-06 Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, pp. 12–19 (2006)

    Google Scholar 

  16. Li, L., Sporleder, C.: Using Gaussian mixture models to detect figurative language in context. In: Proceedings of the NAACL/HLT 2010 (2010)

    Google Scholar 

  17. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013)

    Google Scholar 

  18. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the NIPS (2013)

    Google Scholar 

  19. Nunberg, G., Sag, I.A., Wasow, T.: Idioms. Language 70(3), 491–538 (1994)

    Article  Google Scholar 

  20. Pado, S., Lapata, M.: Dependency-based construction of semantic space models. Comput. Linguist. 33(2), 161–199 (2007)

    Article  MATH  Google Scholar 

  21. Peng, J., Feldman, A., Vylomova, E.: Classifying idiomatic and literal expressions using topic models and intensity of emotions. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2019–2027. Association for Computational Linguistics, Doha, October 2014. http://www.aclweb.org/anthology/D14-1216

  22. Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Proceedings of the 3rd International Conference on Intelligence Text Processing and Computational Linguistics (CICLing 2002), Mexico City, pp. 1–15 (2002)

    Google Scholar 

  23. Seaton, M., Macaulay, A. (eds.): Collins COBUILD Idioms Dictionary, 2nd edn. HarperCollins Publishers (2002)

    Google Scholar 

  24. Sporleder, C., Li, L.: Unsupervised recognition of literal and non-literal use of idiomatic expressions. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), pp. 754–762. Association for Computational Linguistics, Morristown (2009)

    Google Scholar 

  25. Villavicencio, A., Copestake, A., Waldron, B., Lambeau, F.: Lexical encoding of MWEs. In: Proceedings of the Second ACL Workshop on Multiword Expressions: Integrating Processing, Barcelona, pp. 80–87 (2004)

    Google Scholar 

  26. Widdows, D., Dorow, B.: Automatic extraction of idioms using graph analysis and asymmetric lexicosyntactic patterns. In: Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition (DeepLA 2005), pp. 48–56. Association for Computational Linguistics, Stroudsburg (2005). http://dl.acm.org/citation.cfm?id=1631850.1631856

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 1319846.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Peng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Peng, J., Feldman, A. (2017). Automatic Idiom Recognition with Word Embeddings. In: Lossio-Ventura, J., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig SIMBig 2015 2016. Communications in Computer and Information Science, vol 656. Springer, Cham. https://doi.org/10.1007/978-3-319-55209-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-55209-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-55208-8

  • Online ISBN: 978-3-319-55209-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics