Abstract
Many syntactic treebanks and parser toolkits are developed in the past twenty years, including dependency structure parsers and phrase structure parsers. For the phrase structure parsers, they usually utilize different phrase tagsets for different languages, which results in an inconvenience when conducting the multilingual research. This paper designs a refined universal phrase tagset that contains 9 commonly used phrase categories. Furthermore, the mapping covers 25 constituent treebanks and 21 languages. The experiments show that the universal phrase tagset can generally reduce the costs in the parsing models and even improve the parsing accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Xia, F., Palmer, M., Xue, N., et al.: Developing Guidelines and Ensuring Consistency for Chinese Text Annotation. In: Proceedings of LREC (2000)
Xue, N., Xia, F., Chiou, F.-D., Palmer, M.: The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus. Natural Language Engineering 11(2), 207–238 (2005)
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the penn treebank. Comput. Linguist. 19(2), 313–330 (1993)
Bies, A., Ferguson, M., Katz, K., MacIntyre, R.: Bracketing Guidelines for Treebank II Style Penn Treebank Project. Technical paper (1995)
Skut, W., Krenn, B., Brants, T., Uszkoreit, H.: An annotation scheme for free word order languages. In: Conference on ANLP (1997)
Abeillé, A., Clément, L., Toussenel, F.: Building a Treebank for French. Building and Using Parsed Corpora. Kluwer Academic Publishers (2003)
Afonso, S., Bick, E., Haber, R., Santos, D.: Floresta sintá(c)tica: a treebank for Portuguese. In: Proceedings of LREC 2002, pp. 1698–1703 (2002)
Freitas, C., Rocha, P., Bick, E.: Floresta Sintá(c)tica: Bigger, Thicker and Easier. In: Computational Processing of the Portuguese Language Conference (2008)
Petrov, S., Klein, D.: Improved Inference for Unlexicalized Parsing. NAACL (2007)
Petrov, S.: Coarse-to-Fine Natural Language Processing. PHD thesis (2009)
Xue, N., Jiang, Z.: Addendum to the Chinese Treebank Bracketing Guidelines (CTB7.0). Technical paper. University of Pennsylvania (2010)
Kawata, Y., Bartels, J.: Stylebook for the Japanese Treebank in VERBMOBIL. University Tubingen, Report 240 (2000)
Moreno, A., López, S., Alcántara, M.: Spanish Tree Bank: Specifications, Version 5. Technical paper (1999)
Volk, M.: Spanish Expansion of a Parallel Treebank. Technical paper (2009)
Nivre, J., Nilsson, J., Hall, J.: Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In: Proceedings of LREC (2006)
Bies, A., Maamouri, M.: Penn Arabic Treebank Guidelines. Technical report (2003)
Han, C.-H., Han, N.-R., Ko, E.-S.: Bracketing Guidelines for Penn Korean TreeBank. Technical Report, IRCS-01-10 (2001)
Han, C.-H., Han, N.-R., Ko, E.-S., Yi, H., Palmer, M.: Penn Korean Treebank: Development and Evaluation. In: Proceedings of PACLIC (2002)
Wallenberg, J.C., Ingason, A.K., Sigurðsson, E.F., Rögnvaldsson, E.: Icelandic Parsed Historical Corpus (IcePaHC). Version 0.9. Technical report (2011)
Montemagni, S., Barsotti, F., Battista, M., et al.: The Italian Syntactic-Semantic Tree-bank: Architecture, Annotation, Tools and Evaluation. In: Proceedings of the COLING Workshop on Linguistically Interpreted Corpora, pp. 18–27 (2000)
Montemagni, S., Barsotti, F., Battista, M., et al.: Building the Italian Syntactic-Semantic Treebank. In: Abeillé, A. (ed.) Building and using Parsed Corpora. Language and Speech series, ch. 11, pp. 189–210. Kluwer, Dordrecht (2003)
Galves, C., Faria, P.: Tycho Brahe Parsed Corpus of Historical Portuguese. Technical (2010)
Bhatt, R., Farudi, A., Rambow, O.: Hindi-Urdu Phrase Structure Annotation Guidelines. Technical Paper (2012)
Civit, M., MartÃ, M.A.: Building cast3lb: A Spanish treebank. Research on Language & Computation 2(4), 549–574 (2004)
Taulé, M., MartÃ, M.A., Recasens, M.: AnCora: Multilevel Annotated Corpora for Catalan and Spanish. In: Proceedings of LREC 2008, Marrakech, Morocco (2008)
Nguyen, P.-T., Vu, X.-L., et al.: Building a large syntactically-annotated corpus of Vietnamese. In: Lingu. Annotation Workshop, pp. 182–185 (2009)
Ruangrajitpakorn, T., Trakultaweekoon, K., Supnithi, T.: A syntactic resource for thai. 2009. CG treebank. In: Workshop on Asian Language Resources, pp. 96–101 (2009)
Sima’an, K., Itai, A., Winter, Y., Altman, A., Nativ, N.: Building a tree-bank of modern Hebrew text. Journal Traitement Automatique des Langues. Special Issue on Natural Language Processing and Corpus Linguistics 42(2), 347–380 (2001)
Petrov, S., Barrett, L., Thibaux, R., Klein, D.: Learning Accurate, Compact, and Interpretable Tree Annotation. In: COLING and 44th ACL, pp. 433–440 (2006)
Han, A.L.-F., Wong, D.F., Chao, L.S., He, L., Li, S., Zhu, L.: Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation. In: Gurevych, I., Biemann, C., Zesch, T. (eds.) GSCL 2013. LNCS, vol. 8105, pp. 119–131. Springer, Heidelberg (2013)
Van Valin, R.D., Lapolla, R.J.: Syntax, Structure, Meaning and Function. Cambridge University Press (2002)
Carnie, A.: Syntax: A Generative Introduction (Introducing Linguistics). Blackwell Publishing (2002)
Newmeyer, F.J.: Possible and Probable Languages: A Generative Perspective on Linguistic Typology. Oxford University Press (2005)
Rambow, O., Dorr, B., Farwell, D., et al.: Parallel syntactic annotation of multiple languages. In: Proceedings of LREC (2006)
Petrov, S., Das, D., McDonald, R.: A Universal Part-of-Speech Tagset. In: Proceedings of the Eighth LREC (2012)
Naseem, T., Chen, H., Barzilay, R., Johnson, M.: Using universal linguistic knowledge to guide grammar induction. In: Proc. of EMNLP (2010)
McDonald, R., Nivre, J., Quirmbach-Brundage, Y., et al.: Universal Dependency Annotation for Multilingual Parsing. In: Proceedings of ACL (2013)
Danish Arboretum corpus. Arboretum: A syntactic tree corpus of Danish, http://corp.hum.sdu.dk/arboretum.html (accessed December 2013)
Hungarian Szeged Treebank. Szeged Treebank 2.0: A Hungarian natural language database with detailed syntactic analysis. Hungarian linguistics at the University of Szeged
Bick, E., Uibo, H., Muischnek, K.: Preliminary experiments for a CG-based syntactic tree corpus of Estonian, http://corp.hum.sdu.dk/tgrepeye_est.html (accessed December 2013)
Swedish Treebank Syntactic Annotation. Swedish Treebank. Online project, http://stp.lingfil.uu.se/~nivre/swedish_treebank/ (accessed March 2014)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Han, A.LF., Wong, D.F., Chao, L.S., Lu, Y., He, L., Tian, L. (2014). A Universal Phrase Tagset for Multilingual Treebanks. In: Sun, M., Liu, Y., Zhao, J. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2014 2014. Lecture Notes in Computer Science(), vol 8801. Springer, Cham. https://doi.org/10.1007/978-3-319-12277-9_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-12277-9_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12276-2
Online ISBN: 978-3-319-12277-9
eBook Packages: Computer ScienceComputer Science (R0)