Skip to main content
Log in

Compilation of an idiom example database for supervised idiom identification

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Some phrases can be interpreted in their context either idiomatically (figuratively) or literally. The precise identification of idioms is essential in order to achieve full-fledged natural language processing. Because of this, the authors of this paper have created an idiom corpus for Japanese. This paper reports on the corpus itself and the results of an idiom identification experiment conducted using the corpus. The corpus targeted 146 ambiguous idioms, and consists of 102,856 examples, each of which is annotated with a literal/idiomatic label. All sentences were collected from the World Wide Web. For idiom identification, 90 out of the 146 idioms were targeted and a word sense disambiguation (WSD) method was adopted using both common WSD features and idiom-specific features. The corpus and the experiment are both, as far as can be determined, the largest of their kinds. It was discovered that a standard supervised WSD method works well for idiom identification and it achieved accuracy levels of 89.25 and 88.86%, with and without idiom-specific features, respectively. It was also found that the most effective idiom-specific feature is the one that involves the adjacency of idiom constituents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. A preliminary version of this study was presented in Hashimoto and Kawahara (2008). This paper extends the previous paper in several respects. The current paper compares this study with many more previous studies; adds the extensive characterization on Japanese idioms; describes the updated version of our idiom corpus and a newly-developed online browser of the corpus; discusses the full details of features used in the experiment that couldn’t be presented in the previous paper due to the page limitation; and presents additional experimental results concerning individual results without using one of the idiom features.

  2. http://www.cs.sfu.ca/~anoop/students/jbirke/

  3. http://www.multiword.sourceforge.net/PHITE.php?sitesig=FILES&page=FILES_20_Data_Sets

  4. http://www.nlp.iit.tsukuba.ac.jp/must/

  5. For example, (something)-ni-atatte ((something)-dat-run.into) means either “to run into (something)” or “on the occasion of (something),” with the former being the literal interpretation and the latter being the idiomatic interpretation of the compound functional expression.

  6. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T10

  7. http://www.kotoba.nuee.nagoya-u.ac.jp/jc2/kanyo/

  8. http://www.nlp.kuee.kyoto-u.ac.jp/nl-resource/juman.html

  9. http://www.nlp.kuee.kyoto-u.ac.jp/nl-resource/knp.html

  10. This was done collaboratively; in case of any disagreement on the morpho-syntactic status of an idiom, the two native speakers discussed the case in question and reached a settlement.

  11. Those of the other 250 idioms (27.1%) are infrequent miscellaneous structures like (N-P ((N-P V) (N-P V))) of the idiom ato-ha no-to nare yama-to nare (future-top field-dat become mountain-dat become) “I don’t care what happens afterwards.”

  12. The arrows indicate dependency relations.

  13. Note that some idioms, such as by and large and saba-o yomu (chub.mackerel-acc read) “cheating in counting,” do not have a literal meaning. They are not dealt with in this paper.

  14. It may be difficult to determine some interpretations (literal or idiomatic) and such a decision may only be possible by looking at token usages of candidate phrases. However, such a token usage-based decision for classifying idiom types was not used because of the prohibitive cost involved.

  15. For example, hara-o kimeru (belly-acc decide) “to make up one’s mind” was judged as ambiguous by one of the Group B members. Its literal interpretation would be “decide on which belly to (do something),” which sounds unnatural regardless of the context.

  16. Those of the other nine idioms (6.8%) are infrequent miscellaneous structures like (V-Aux V-Aux) of the idiom nessi-yasuku same-yasui (heat-easy.to cool.down-easy.to) “tend to be enthusiastic (about something) but also tend to be tired (of it).”

  17. The way in which the 90 idioms were selected is described in Sect. 5.2.

  18. For idioms sampled for preliminary annotation, through which the issues of annotation were identified and the specifications of annotation were established, more than 1,000 examples were annotated.

  19. Among the 107,598 examples worked on by the annotators, 258 examples were collected by parser mistakes and 4,484 examples lacked sufficient context to interpret target phrases correctly. Decisions regarding whether an example should be discarded were made by the annotator who was in charge and one of the authors.

  20. The current release of the corpus, which is now available, is described in Sect. 4.4.

  21. http://www.openmwe.sourceforge.jp/

  22. Although the dictionary has been carefully constructed by hand, the corpus may still contain some problematic examples. The removal of any such examples is the subject of a future project.

  23. http://www.openmwe.sourceforge.jp/cgi-bin/corpus_browser.cgi

  24. http://www.chasen.org/~taku/software/TinySVM/

  25. Bear in mind that HSU implemented them in handcrafted rules, while they were adapted in this study to a machine learning framework.

  26. “Volitional modality” represents verbal expressions of order, request, permission, prohibition, and volition.

  27. The F-Measure of HSU’s baseline system was 0.734.

  28. Note that Japanese is a head final language.

  29. Functional words attached to either the f4 word or the f5 word are ignored. In the example, no (gen) is ignored.

  30. Passivization is indicated by the suffix (r)are in Japanese, but the same suffix is also used for honorification, potentials and spontaneous potentials. These were not distinguished, as doing so is beyond the capabilities of current technology.

  31. Note that f10, f11 and f12 are applied only to those idioms that can be used as predicates.

  32. Ninety examples were unavailable due to feature extraction failure. This was caused by KNP’s inability to handle very long sentences; it gives up parsing when the size of CKY table exceeds a hard-coded threshold. Thus, fewer examples were used for the experiment than were included in the corpus.

  33. The McNemar test was conducted on the ratio of correct and incorrect idiom example classifications between the two groups, “with idiom features” and “without idiom features.” The idiom examples used for the test were all of the data described in Table 5, and thus were identical across the two groups.

  34. For ease of reference, the first row shows the result with all of the idiom features used.

  35. Note that a greater performance drop indicates a greater contribution.

  36. This result is inconsistent with that obtained in HSU, in which it was reported that grammatical constraints involving adnominal modification were most effective. The present study suspects that HSU’s observation is not particularly reliable because only 15 test sentences were considered when investigating the best performing grammatical constraint (Hashimoto et al. 2006a, Sect. 4.3).

  37. It might be argued that different feature sets should have been used for different idioms in order to obtain better results. However, doing this would be unrealistic when dealing with so many more idioms, since it would mean that the best feature sets would need to be carefully examined for each idiom.

References

  • Baldwin, T., Bannard, C., Tanaka, T., & Widdows, D. (2003). An empirical model of multiword expression decomposability. In Proceedings of the workshop on multiword expressions: Analysis, acquisition and treatment. pp. 89–96.

  • Birke, J., & Sarkar, A. (2006). A clustering approach for the nearly unsupervised recoginition of nonliteral language. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL 2006). pp. 329–336.

  • Cook, P., Fazly, A., & Stevenson, S. (2007). Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In: Proceedings of the workshop on a broader perspective on multiword expressions, pp. 41–48.

  • Cook, P., Fazly, A., & Stevenson, S. (2008). The VNC-tokens dataset’. In: Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE2008). pp. 19–22.

  • Edmonds, P., & Cotton, S. (2001). SENSEVAL-2: Overview. In Proceedings of the second international workshop on evaluating word sense disambiguation systems (SENSEVAL-2), pp. 1–5.

  • Fazly, A., & Stevenson, S. (2006). Automatically constructing a Lexicon of verb phrase idiomatic combinations. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL-2006), pp. 337–344.

  • Grégoire, N., Evert, S., & Kim, S. N. (Eds.) (2007). Proceedings of the workshop on a broader perspective on multiword expressions. Prague: Association for Computational Linguistics.

  • Grégoire, N., Evert, S., & Krenn, B. (Eds.) (2008). Proceedings of the LREC workshop towards a shared task for multiword expressions. Marrakech: ACL Special Interest Group on the Lexicon (SIGLEX).

  • Hashimoto, C., & Kawahara, D. (2008). Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features. In: Proceedings of the conference on empirical methods in natural language processing 2008 (EMNLP-2008). pp. 991–1000.

  • Hashimoto, C., & Kurohashi, S. (2007). Construction of domain dictionary for fundamental vocabulary. In: Proceedings of the 45th annual meeting of the association for computational linguistics (ACL’07) Poster. pp. 137–140.

  • Hashimoto, C., & Kurohashi, S. (2008). Blog categorization exploiting domain dictionary and dynamically estimated domains of unknown words. In: Proceedings of the 46th annual meeting of the association for computational linguistics (ACL’08) Short paper, Poster. pp. 69–72.

  • Hashimoto, C, Sato, S., & Utsuro, T. (2006a) Detecting Japanese idioms with a linguistically rich dictionary. Language Resources and Evaluation 40(3–4), 243–252.

    Google Scholar 

  • Hashimoto, C., Sato, S., & Utsuro, T. (2006b). Japanese idiom recognition: Drawing a line between literal and idiomatic meanings’. In: The joint 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING/ACL 2006) Poster. Sydney, pp. 353–360.

  • Isahara, H., Bond, F., Uchimoto, K., Utiyama, M., & Kanzaki, K. (2008). Development of the Japanese WordNet. In The sixth international conference on language resources and evaluation (LREC2008).

  • Ishida, P. (2000). Doushi Kanyouku-ni taisuru Tougoteki Sousa-no Kaisou Kankei (On the Hierarchy of Syntactic Operations Applicable to Verb Idioms). Nihongo Kagaku (Japanese Linguistics) 7, 24–43.

    Google Scholar 

  • Katz, G., & Giesbrecht, E. (2006). Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In Proceedings of the workshop, COLING/ACL 2006, multiword expressions: Identifying and exploiting underlying properties. pp. 12–19.

  • Kawahara, D., & Kurohashi, S. (2006). Case frame compilation from the Web using high-performance computing. In: Proceedings of the 5th international conference on language resources and evaluation (LREC-06), pp. 1344–1347.

  • Kilgarriff, A., Palmer, M. (2000). Introduction to the special issue on SENSEVAL. Computers and the Humanities 34(1–2), 1–13.

    Article  Google Scholar 

  • Kindaichi, H. (2005). Shogakusei no Manga Kanyouku Jiten (Comic dictionary of idioms for elementary school children). Shogakukan.

  • Kindaichi, H., & Kindaichi, H. (2005). Shin Reinbo Shogaku Kokugo Jiten (New Rainbow Japanese dictionary for elementary school). Gakken.

  • Kindaichi, K. (2006). Shogakukan Gakushu Kokugo Shin Jiten Zentei Dainihan (Shogaku-kan’s Japanese new dictionary for learners, 2nd edn). Shogaukan.

  • Krenn, B., & Evert, S. (2001). Can we do better than frequency? A case study on extracting PP-verb collocations. In: Proceedings of the workshop on collocations. pp. 39–46.

  • Kuiper, K., McCann, H., Quinn, H., Aitchison,T., & van der Veer, K. (2003). SAID: A syntactically annotated idiom dataset’. Linguistic data consortium, LDC2003T10. Pennsylvania.

  • Kurohashi, S., & Nagao, M. (1994). A syntactic analysis method of long Japanese sentences based on the detection of conjunctive structures. Computational Linguistics 20(4), 507–534.

    Google Scholar 

  • Kurohashi, S., Nakamura, T., Matsumoto, Y., & Nagao, M. (1994). Improvements of Japanese mophological analyzer JUMAN. In: Proceedings of the international workshop on sharable natural language resources, pp. 22–28.

  • Lee, Y. K., & Ng, H. T. (2002). An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In: EMNLP ’02: Proceedings of the ACL-02 conference on empirical methods in natural language processing, pp. 41–48.

  • Lin, D. (1999). Automatic identification of non-compositional phrases. In: Proceeding of the 37th annual meeting of the association for computational linguistics, pp. 317–324.

  • Magnini, B., Strapparava, C., Pezzulo, G., Gliozzo, A. (2002). The role of domain information in word sense disambiguation. Natural language Engineering, Special Issue on Word Sense Disambiguation, 8(3), 359–373.

    Google Scholar 

  • Miyaji, Y. (1982). Usage and semantics of idioms. Meiji Shoin. (in Japanese).

  • Moirón, B. V., Villavicencio, A., McCarthy, D., Evert, S., & Stevenson S. (Eds.) (2006). Proceedings of the workshop on multiword expressions: Identifying and exploiting underlying properties. Sydney, Australia: Association for Computational Linguistics.

  • Morita, Y. (1985). DoushiKanyouku (Verb Idioms). Nihongogaku (Japanese Linguistics) 4(1), 37–44.

    Google Scholar 

  • Rayson, P., Moirón, B. V., Sharoff, S., Piao, S., & Evert, S. (Eds.) (2008). International Journal of Language Resources and Evaluation. Springer (Special issue on Multiword expressions: hard going or plain sailing?)

  • Rayson, P., Sharoff, S., & Adolphs S. (Eds.) (2006). Proceedings of EACL 2006 workshop on multi-word-expressions in a multilingual context. Trento, Italy: European Chapter of the Association for Computational Linguistics.

  • Sato, S. (2007). Compilation of a comparative list of basic Japanese idioms from five sources. In: IPSJ 2007-NL-178, pp. 1–6. (in Japanese).

  • Shudo K., Tanabe, T., Takahashi, M., & Yoshimura, K. (2004). MWEs as non-propositional content indicators. In The 2nd ACL workshop on multiword expressions: Integrating processing. pp. 32–39.

  • Takahashi, T., Soonsang, H., Taura, K., & Yonezawa, A. (2002). World Wide Web Crawler. In Poster proceedings of the 11th international World Wide Web conference.

  • Tanaka, T., Bond, F., Baldwin, T., Fujita, S., & Hashimoto, C. (2007). Word sense disambiguation incorporating lexical and structural semantic information. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp. 477–485.

  • Tsuchiya, M., Utsuro, T., Matsuyoshi, S., Sato, S., & Nakagawa, S. (2006). Development and analysis of an example database of Japanese compound functional expressions. Transactions of Information Processing Society of Japan 47(6), 1728–1741. (in Japanese).

    Google Scholar 

  • Uchiyama, K., Baldwin, T., & Ishizaki, S. (2005). Disambiguating Japanese compound verbs. Computer Speech and Language, Special Issue on Multiword Expressions 19(4), 497–512.

    Google Scholar 

  • Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.

    Google Scholar 

  • Villavicencio, A., Bond, F., Korhonen, A., & McCarthy D. (Eds.) (2005). Journal of Computer Speech and Language: Special Issue on Multiword Expressions. Elsevier.

  • Yonekawa, A., & Ohtani, I. (2005) Nihongo Kanyouku Jiten (Japanese idiom dictionary). Tokyo-do Shuppan.

Download references

Acknowledgments

This work was conducted as part of the collaborative research project of Kyoto University and NTT Communication Science Laboratories. The work was supported by NTT Communication Science Laboratories and JSPS Grants-in-Aid for Young Scientists (B) 19700141. We would like to thank the members of the collaborative research group of Kyoto University and NTT Communication Science Laboratories and Dr. Francis Bond for their stimulating discussion. Thanks are also due to Prof. Satoshi Sato, who kindly provided us with the list of basic Japanese idioms.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chikara Hashimoto.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hashimoto, C., Kawahara, D. Compilation of an idiom example database for supervised idiom identification. Lang Resources & Evaluation 43, 355–384 (2009). https://doi.org/10.1007/s10579-009-9104-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-009-9104-1

Keywords

Navigation