Abstract
We have constructed a large scale and detailed database of lexical types in Japanese from a treebank that includes detailed linguistic information. The database helps treebank annotators and grammar developers to share precise knowledge about the grammatical status of words that constitute the treebank, allowing for consistent large-scale treebanking and grammar development. In addition, it clarifies what lexical types are needed for precise Japanese NLP on the basis of the treebank. In this paper, we report on the motivation and methodology of the database construction.
Similar content being viewed by others
Notes
Currently, the Hinoki treebank contains about 121,000 sentences (about 10 words per sentence).
We think we also need another snapshot, that of the grammar rules and principles being used. In this paper, however, we do not deal with it.
These are actual names of the lexical types implemented in our grammar and might not be understandable to people in general.
The object, a conclusion, is expressed by a phonologically null pronoun.
Note that this information is not explicitly stored in the database. Rather, it is dynamically compiled from the database together with a lexicon database, when triggered by a user query. User queries are words like ni.
References
Bond, F., Fujita, S., Hashimoto, C., Nariyama, S., Nichols, E., Ohtani, A., Tanaka, T., & Amano, S. (2004a). The Hinoki Treebank—toward text understanding. In Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora (LINC-04), Geneva, pp. 7–10.
Bond, F., Fujita, S., & Tanaka, T. (2006). The Hinoki syntactic and semantic treebank of Japanese. Language Resources and Evaluation , 40(3–4), 253–261.
Bond, F., Nichols, E., Fujita, S., & Tanaka, T. (2004b). Acquiring an Ontology for a Fundamental Vocabulary. In 20th International Conference on Computational Linguistics (COLING-2004), Geneva, pp. 1319–1325.
Breen, J. W. (2004). JMDict: A Japanese-multilingual dictionary. In Coling 2004 Workshop on Multilingual Linguistic Resources, Geneva, pp. 71–78.
Dini, L., & Mazzini, G. (1997). Hypertextual grammar development. In Computational Environments for Grammar Development and Linguistic Engineering, Madrid, pp. 24–29.
Ikehara, S., Shirai, S., Yokoo, A., & Nakaiwa, H. (1991). Toward an MT system without pre-editing—Effects of new methods in ALT-J/E–. In Third Machine Translation Summit: MT Summit III. Washington, DC, pp. 101–106. (http://xxx.lanl.gov/abs/cmp-lg/9510008).
Kurohashi, S., & Nagao, M. (2003). Building a Japanese parsed corpus. In A. Abeille (Ed.), Treebanks: Building and using parsed corpora (Chap. 14, pp. 249–260). Kluwer Academic Publishers.
Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, H., Takaoka, K., & Asahara, M. (2000). Morphological analysis system ChaSen version 2.2.1 manual. Nara Institute of Science and Technology.
Miyazaki, M., Shirai, S., & Ikehara, S. (1995). Gengo katēsetsu-ni motozuku nihongo hinshi-no taikēka-to sono kōyō [A Japanese syntactic category system based on the constructive process theory and its use]. Journal of Natural Language Processing, 2(3), 3–25 (in Japanese).
Oepen, S., Flickinger, D., Toutanova, K., & Manning, C. D. (2002). LinGO Redwoods: A rich and dynamic treebank for HPSG. In Proceedings of The First Workshop on Treebanks and Linguistic Theories, Sozopol, Bulgaria, pp. 139–149.
Ohara, K. H., Fujii, S., Ohori, T., Suzuki, R., Saito, H., & Ishizaki, S. (2004). The Japanese FrameNet Project: An introduction. In Proceedings of the LREC-2004 Satellite Workshop Building Lexical Resources from Semantically Annotated Corpora, pp. 9–11.
Siegel, M. (2006). JACY, A grammar for annotating syntax, semantics and pragmatics of written and spoken Japanese for NLP application purposes, Habilitation thesis.
Siegel, M., & Bender, E. M. (2002). Efficient deep processing of Japanese. In Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization. Taipei, Taiwan.
Takeuchi, K., Inui, K., & Fujita, A. (2006). Description of syntactic and semantic characteristics of Japanese verbs based on lexical conceptual structure. In Lexicon Forum, Vol. 2, Hituzi Syobou, pp. 85–120 (in Japanese).
Toutanova, K., Manning, C. D., Flickinger, D., & Oepen, S. (2005). Stochastic HPSG Parse disambiguation using the Redwoods corpus. Research on Language and Computation, 3(1), 83–105.
Tsuchiya, M., Utsuro, T., Matsuyoshi, S., Sato, S., & Nakagawa, S. (2005). A corpus for classifying usages of Japanese compound functional expressions. In Proceedings of Pacific Association for Computational Linguistics 2005. Tokyo, Japan.
Acknowledgements
We would like to thank the other members of NTT Natural Language Group, Dan Flickinger, Stephen Oepen, and Jason Katz-Brown for their stimulating discussion.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hashimoto, C., Bond, F., Tanaka, T. et al. Semi-automatic documentation of an implemented linguistic grammar augmented with a treebank. Lang Resources & Evaluation 42, 117–126 (2008). https://doi.org/10.1007/s10579-008-9065-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-008-9065-9