Skip to main content
Log in

Ensuring annotation consistency and accuracy for Vietnamese treebank

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Treebanks are important resources for researchers in natural language processing. They provide training and testing materials so that different algorithms can be compared. However, it is not a trivial task to construct high-quality treebanks. We have not yet had a proper treebank for such a low-resource language as Vietnamese, which has probably lowered the performance of Vietnamese language processing. We have been building a consistent and accurate Vietnamese treebank to alleviate such situations. Our treebank is annotated with three layers: word segmentation, part-of-speech tagging, and bracketing. We developed detailed annotation guidelines for each layer by presenting Vietnamese linguistic issues as well as methods of addressing them. Here, we also describe approaches to controlling annotation quality while ensuring a reasonable annotation speed. We specifically designed an appropriate annotation process and an effective process to train annotators. In addition, we implemented several support tools to improve annotation speed and to control the consistency of the treebank. The results from experiments revealed that both inter-annotator agreement and accuracy were higher than 90%, which indicated that the treebank is reliable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. Underscore “_” is used to link syllables of Vietnamese multi-syllabic words. English translations of Vietnamese words have been provided as subscripts. If a Vietnamese word does not have a translatable meaning, the subscript is blank. The translation of the Vietnamese sentence is provided in braces below the original text.

  2. In this example, cái is a classifier noun in Vietnamese. Classifier nouns indicate two types of entities: animate and inanimate things.

  3. Categorization nouns indicate general entities, such as \(_{fish}\) and cây \(_{tree}\).

  4. http://www.tuoitre.com.

  5. http://www.thanhnien.vn.

  6. Types of words are presented in Sect. 4.1.

  7. Word-internal structures are presented in Sect. 4.3.4.

  8. Free syllables are those having either a lexical or a functional meaning. A free syllable can stand alone as a word.

  9. A bound syllable is a syllable that cannot stand alone as a single-syllabic word. A bound syllable does not necessarily have meaning. It always combines with other syllables or words to create a compound word.

  10. Without loss of generalization, we assume the expression we want to segment is A B, where A and B can be syllables or words.

  11. Special classifier nouns, e.g., sự \(_{-ing/-ion/-ity/...}\), việc \(_{-ing/-ion/-ity/...}\), and nhà \(_{-er/-or}\), have been considered to be classifier nouns by Nguyen et al. (2015). However, the collocation of special classifier nouns is different from that of classifier nouns such as cái and con. Classifier nouns are placed before categorization nouns to indicate animate (con cá \(_{fish}\)) and inanimate entities (cái bàn \(_{table}\)). While special classifier nouns play the same role as affixes in English, they can combine with a verb, such as việc \(_{-ion}\) lựa_chọn \(_{\textit{to\, select}}\) {selection}, or an adjective, such as sự \(_{-ity}\) nhập_nhằng \(_{ambiguous}\) {ambiguity}.

  12. Collocations of a word express its abilities to collocate with other words. For example, the demonstrative pronoun này \(_{this/these}\) can collocate with a noun (e.g., cây \(_{tree}\) này \(_{this}\) {this tree}). However, personal pronouns, such as chúng_tôi \(_{we/us}\) and họ \(_{they/them}\) do not have these collocations.

  13. Syntactic functions of a word denote the syntactic roles of the word in phrases (such as the head of a phrase or a modifier) and in sentences.

  14. Việc is a special classifier noun that is understood as -ion, -ment, -ing, -ity, and -ness. when it precedes verbs or adjectives. A combined expression of the special classifier noun việc and a verb or an adjective is understood to be a noun in English. For example, học_tập means to learn; hence, we can use việc học_tập to express learning.

  15. đang is an adjunct used to express continuation. For example, sinh_sống means to live. To express to be living, we use đang sinh_sống.

  16. Demonstrative pronouns in Vietnamese, such as này \(_{this/these}\), đó \(_{that/those}\), ấy \(_{that/those}\), and kia \(_{that/those}\) play the same roles as demonstrative adjectives in English, but they are placed at the end of noun phrases.

  17. Preprocessing steps included cleaning data, topic classification, sentence segmentation, and annotation by using automatic tools.

  18. http://chasen.org/taku/software/yamcha/.

References

  • Abeillé, A., Clément, L., & Toussenel, F. (2003). Building a treebank for french. In Treebanks (pp. 165–187). New York: Springer.

  • Allauzen, A., Aufrant, L., Burlot, F., Knyazeva, E., Lavergne, T., & Yvon, F. (2016). Limsi@ wmt’16: Machine translation of news. In Proceedings of the first conference on machine translation (pp. 239–245). Association for Computational Linguistics.

  • Barr, C., Jones, R., & Regelson, M. (2008). The linguistic structure of English web-search queries. In Proceedings of the conference on empirical methods in natural language processing (pp. 1021–1030). Association for Computational Linguistics.

  • Bies, A., Ferguson, M., Katz, K., MacIntyre, R., Tredinnick, V., Kim, G., et al. (1995). Bracketing guidelines for treebank II style penn treebank project. Philadelphia: University of Pennsylvania.

    Google Scholar 

  • Cai, J., Utiyama, M., Sumita, E., & Zhang, Y. (2014). Dependency-based pre-ordering for Chinese–English machine translation. In Proceedings of the 52nd annual meeting of the association for computational linguistics (pp. 155–160). Association for Computational Linguistics.

  • Chang, P. C., Galley, M., & Manning, C. D. (2008). Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the third workshop on statistical machine translation (pp. 224–232). Association for Computational Linguistics.

  • Chinkina, M., Kannan, M., & Meurers, D. (2016). Online information retrieval for language learning. In Proceedings of the 54th annual meeting of the association for computational linguistics-system demonstrations (pp. 7–12).

  • Corp, D.C.S. LacViet. (2011). Vietnamese dictionary. LacViet Corp.

  • Diep, Q.-B. (2005). Vietnamese grammar. Ha Noi: Vietnam Education Publisher.

    Google Scholar 

  • Dinh, D., & Vu, T. (2006). A maximum entropy approach for Vietnamese word segmentation. In Proceedings of research, innovation and vision for the future in computing and communication technologies (pp. 248–253). IEEE.

  • Di Sciullo, A. M., & Williams, E. (1987). On the definition of word (Vol. 14). New York: Springer.

    Google Scholar 

  • Fang, A. C., & Cao, J. (2010). Enhanced genre classification through linguistically fine-grained pos tags. In Proceedings of paclic (pp. 85–94).

  • Galitsky, B., Ilvovsky, D. I., Kuznetsov, S. O. & Strok, F. (2013). Matching sets of parse trees for answering multi-sentence questions. In Proceedings of RANLP (pp. 285–293).

  • Han, C. H., Han, N. R., Ko, E. S., & Palmer, M. (2002). Development and evaluation of a Korean treebank and its application to NLP. In Proceedings of the 3rd international conference on language resources and evaluation (LREC-2002) (pp. 1635–1642).

  • Hoang, P. (1998). Vietnamese dictionary. Singapore: Scientific & Technical Publishing.

    Google Scholar 

  • Hoshino, S., Miyao, Y., Sudoh, K., Hayashi, K., & Nagata, M. (2015). Discriminative preordering meets Kendall’s tau maximization. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (short papers) (pp. 139–144). Association for Computational Linguistics.

  • Jijkoun, V., De Rijke, M., & Mur, J. (2004). Information extraction for question answering: Improving recall through syntactic patterns. In Proceedings of the 20th international conference on computational linguistics (pp. 1284). Association for Computational Linguistics.

  • Katz-Brown, J., Petrov, S., McDonald, R., Och, F., Talbot, D., Ichikawa, H., Seno, M., & Kazawa, H. (2011). Training a parser for machine translation reordering. In Proceedings of the conference on empirical methods in natural language processing (pp. 183–192). Association for Computational Linguistics.

  • Le, H. P., Nguyen, T. M. H., & Roussanaly, A. (2012). Vietnamese parsing with an automatically extracted tree-adjoining grammar. In Proceedings of research, innovation and vision for the future in computing and communication technologies (RIVF) (pp. 1–6). IEEE.

  • Le, A. C., Nguyen, P. T., Vuong, H. T., Pham, M. T., & Ho, T. B. (2009). An experimental study on lexicalized statistical parsing for Vietnamese. In Proceedings of knowledge and systems engineering (pp. 162–167). IEEE.

  • Le-Hong Phuong, N. T. M., Huyen, A. R., & Vinh, H. T. (2008). A hybrid approach to word segmentation of Vietnamese texts. In Proceedings of the 2nd international conference on language and automata theory and applications.

  • Le-Hong, P., Roussanaly, A., Nguyen, T. M. H., & Rossignol, M. (2010). An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts. In Traitement Automatique des Langues Naturelles-taln 2010 (pp. 12).

  • Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2), 313–330.

    Google Scholar 

  • Miyao, Y., & Tsujii, J. (2008). Feature forest models for probabilistic HPSG parsing. Computational Linguistics, 34(1), 35–80.

    Article  Google Scholar 

  • Nghiem, M., Dinh, D., & Nguyen, M. (2008). Improving Vietnamese POS tagging by integrating a rich feature set and support vector machines. In Proceedings of research, innovation and vision for the future in computing and communication technologies (RIVF) (pp. 128–133). IEEE.

  • Nguyen, T. M. H., Hoang, T. T. L., & Vu, X. L. (2010). Vietnamese word segmentation guidelines. Technical report sp 8.2. Ministry of Education and Training (Vietnam).

  • Nguyen, Q. T., Miyao, Y., Le, H. T. T., & Nguyen, N. L. T. (2016). Challenges and solutions for consistent annotation of Vietnamese treebank. In Proceedings of the language resources and evaluation conference.

  • Nguyen, Q. T., Nguyen, N. L. T., & Miyao, Y. (2012). Comparing different criteria for Vietnamese word segmentation. In Proceedings of 3rd workshop on south and southeast asian natural language processing (SANLP) (pp. 53–68). Citeseer.

  • Nguyen, Q. T., Nguyen, N. L. T., & Miyao, Y. (2013). Utilizing state-of-the-art parsers to diagnose problems in treebank annotation for a less resourced language. In Proceedings of the 7th linguistic annotation workshop & interoperability with discourse (pp. 19–27). Association for Computational Linguistics.

  • Nguyen, Q. D., Nguyen, Q. D., Pham, B. S., Nguyen, P. T., & Nguyen, L. M. (2014). From treebank conversion to automatic dependency parsing for Vietnamese. In Natural language processing and information systems (pp. 196–207). New York: Springer.

  • Nguyen, P. T., Le, A. C., Ho, T. B., & Nguyen, V. H. (2015). Vietnamese treebank construction and entropy-based error detection. Language Resources and Evaluation, 49(3), 487–519.

    Article  Google Scholar 

  • Nguyen, P. T., Vu, X. L., & Nguyen, T. M. H. (2010a). Vietnamese part-of-speech tagging guidelines. Technical report sp 7.3. Ministry of Education and Training (Vietnam).

  • Nguyen, P. T., Vu, X. L., Nguyen, T. M. H., Nguyen, V. H., & Le, H. P. (2009). Building a large syntactically-annotated corpus of Vietnamese. In Proceedings of the third linguistic annotation workshop (pp. 182–185). Association for Computational Linguistics.

  • Nguyen, P. T., Vu, X. L, Nguyen, T. M. H., Dao, M. T., Dao, T. M. N., Le, K. N. (2010b). Vietnamese bracketing guidelines. Technical report sp7.3. Ministry of Education and Training (Vietnam).

  • Peng, F., & Huang, X. (2007). Machine learning for asian language text classification. Journal of Documentation, 63(3), 378–397.

    Article  Google Scholar 

  • Petrov, S., Barrett, L., Thibaux, R., & Klein, D. (2006). Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics (pp. 433–440). Association for Computational Linguistics.

  • Santorini, B. (1990). Part-of-speech tagging guidelines for the penn treebank project. Pennsylvania: University of Pennsylvania.

    Google Scholar 

  • SCSSV. (1983). Vietnamese grammar. Social Sciences Publishers.

  • Socher, R., Bauer, J., Manning, C. D., & Ng, A. Y. (2013). Parsing with compositional vector grammars. In Proceedings of the ACL conference. Citeseer.

  • Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 conference of the north american chapter of the association for computational linguistics on human language technology (Vol. 1, pp. 173–180). Association for Computational Linguistics.

  • Tsuruoka, Y., Miyao, Y., & Kazama, J. (2011). Learning with lookahead: Can history-based models rival globally optimized models? In Proceedings of the fifteenth conference on computational natural language learning (pp. 238–246). Association for Computational Linguistics.

  • Verberne, S., Boves, L., Oostdijk, N., & Coppen, P. A. (2008). Using syntactic information for improving why-question answering. In Proceedings of the 22nd international conference on computational linguistics (Vol. 1, pp. 953–960). Association for Computational Linguistics.

  • Xia, F. (2000a). The part-of-speech tagging guidelines for the penn Chinese treebank (3.0). Technical report IRCS 00-07. University of Pennsylvania.

  • Xia, F. (2000b). The segmentation guidelines for the penn Chinese treebank (3.0). Technical report IRCS 00-06. University of Pennsylvania.

  • Xia, F., Palmer, M., Xue, N., Okurowski, M. E., Kovarik, J., Chiou, F. D., Huang, S., Kroch, T., & Marcus, M. P. (2000). Developing guidelines and ensuring consistency for Chinese text annotation. In Proceedings of the second international conference on language resources and evaluation.

  • Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The penn chinese treebank: Phrase structure annotation of a large corpus. Natural Language Eengineering, 11(02), 207–238.

    Article  Google Scholar 

  • Xue, N., Xia, F., Huang, S., & Kroch, A. (2000). The bracketing guidelines for the penn Chinese treebank (3.0). Technical report IRCS 00-08. University of Pennsylvania.

Download references

Acknowledgements

We would like to thank Assoc. Prof. Dien Dinh and Dr. Ngan L.T. Nguyen for their comments and the discussions we had with them during the early stages of developing the guidelines. We also would like to thank our annotators for their cooperation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quy T. Nguyen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, Q.T., Miyao, Y., Le, H.T.T. et al. Ensuring annotation consistency and accuracy for Vietnamese treebank. Lang Resources & Evaluation 52, 269–315 (2018). https://doi.org/10.1007/s10579-017-9398-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-017-9398-3

Keywords

Navigation