Abstract
In this paper, we propose a segment-based annotation tool providing appropriate interactivity between a human annotator and an automatic parser. The proposed annotation tool provides the preview of a complete sentence structure suggested by the parser, and updates the preview whenever the annotator cancels or selects each segmentation point. Thus, the annotator can select the proper sentence segments maximizing parsing accuracy and minimizing human intervention. Experimental results show that the proposed tool allows the annotator to be able to reduce human intervention by approximately 39% compared with manual annotation. Sejong Korean treebank, one of the large scale treebanks, was constructed with the proposed annotation tool.
Similar content being viewed by others
Notes
The segmentation model, \(\mathop{}\limits_{{s_{1n}}}^{argmax} \prod^{n}_{i=0}P(s_{i}|t_{i},t_{i+1}),\) performs at 81.31% precision and 64.62% recall where the precision indicates the ratio of correct candidate segmentation points ‘)(’ from candidate segmentation points ‘)(’ generated by the parsing model while the recall indicates the ratio of correct candidate segmentation points ‘)(’ from correct segmentation points ‘)(’ in the test set of the treebank.
References
Bohmova, A., Hajic, J., Hajicova, E., & Hladka, B. (2001). The Prague dependency treebank: Three-level annotation scenario. In A. Abeille (Ed.), Treebanks: Building and using syntactically annotated corpora. Dordrecht, The Netherlands: Kluwer Academic Publishers.
Choi, K.-S. (2001). KAIST language resources ver. 2001. The Result of Core Software Project from Ministry of Science and Technology, http://kibs.kaist.ac.kr. (written in Korean)
Doi, S., Muraki, K., Kamei, S., & Yamabana, K. (1993). Long sentence analysis by domain-specific pattern grammar. In Proceedings of the 6th conference on the European chapter of the association of computational linguistics, p. 466.
Goodman, J. (1996). Parsing algorithms and metrics. In Proceedings of the annual meeting of the association for computational linguistics, pp. 177–183.
Hindle, D. (1989). Acquiring disambiguation rules from text. In Proceedings of the annual meeting of the association for computational linguistics, pp. 118–125.
Kakkonen, T. (2005). Dependency treebanks: Methods, annotation schemes and tools. In Proceedings of the 15th Nordic conference of computational linguistics, pp. 94–104.
Kim, S., Zhang, B., & Kim, Y. (2000). Reducing parsing complexity by intra-sentence segmentation based on maximum entropy model. In Proceedings of the joint SIGDAT conference on empirical methods in natural language processing and very large corpora, pp. 164–171.
Kim, U.-S., & Kang, B.-M. (2002). Principles, methods and some problems in compiling a Korean treebank. In Proceedings of Hangul and Korean information processing conference 1997, pp. 155–162.
Li, W.-C., Pei, T., Lee, B.-H., & Chiou, C.-F. (1990). Parsing long English sentences with pattern rules. In Proceedings of the 13th international conference on computational linguistics, pp. 410–412.
Lim, J.-H., Park, S.-Y., Kwak, Y.-J., & Rim, H.-C. (2004). A semi-automatic tree annotating workbench for building a Korean treebank. Lecture Note in Computer Science, 2945, 253–257.
Mitchell, P. M., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19(2), 313–330.
Park, S.-Y., Kwak, Y.-J., Lim, J.-H., & Rim, H.-C. (2004). A probabilistic feature-based parsing model for head-final languages. IEICE Transaction on Information & System, E87-D(12), 2286–2289.
Plaehen, O., & Brants, T. (2000). Annotate—an efficient interactive annotation tool. In Proceedings of the 6th applied natural language processing conference, pp. 214–225.
Rambow, O., Creswell, C., Szekely, R., Taber, H., & Walker, M. (2002). A dependency treebank for English. In Proceedings of the 3rd international conference on language resources and evaluation, Vol. 3, pp. 857–863.
Acknowledgments
This work was supported partly by grant R01-2006-000-11162-0 from the Korea Science & Engineering Foundation’s Basic Research Program and partly by the second stage of the BK-21 project.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Park, SY., Song, YI. & Rim, HC. A segment-based annotation tool for Korean treebanks with minimal human intervention. Lang Resources & Evaluation 40, 281–289 (2006). https://doi.org/10.1007/s10579-007-9029-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-007-9029-5