Skip to main content

Abstract

Many syntactic treebanks and parser toolkits are developed in the past twenty years, including dependency structure parsers and phrase structure parsers. For the phrase structure parsers, they usually utilize different phrase tagsets for different languages, which results in an inconvenience when conducting the multilingual research. This paper designs a refined universal phrase tagset that contains 9 commonly used phrase categories. Furthermore, the mapping covers 25 constituent treebanks and 21 languages. The experiments show that the universal phrase tagset can generally reduce the costs in the parsing models and even improve the parsing accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Xia, F., Palmer, M., Xue, N., et al.: Developing Guidelines and Ensuring Consistency for Chinese Text Annotation. In: Proceedings of LREC (2000)

    Google Scholar 

  2. Xue, N., Xia, F., Chiou, F.-D., Palmer, M.: The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus. Natural Language Engineering 11(2), 207–238 (2005)

    Article  Google Scholar 

  3. Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the penn treebank. Comput. Linguist. 19(2), 313–330 (1993)

    Google Scholar 

  4. Bies, A., Ferguson, M., Katz, K., MacIntyre, R.: Bracketing Guidelines for Treebank II Style Penn Treebank Project. Technical paper (1995)

    Google Scholar 

  5. Skut, W., Krenn, B., Brants, T., Uszkoreit, H.: An annotation scheme for free word order languages. In: Conference on ANLP (1997)

    Google Scholar 

  6. Abeillé, A., Clément, L., Toussenel, F.: Building a Treebank for French. Building and Using Parsed Corpora. Kluwer Academic Publishers (2003)

    Google Scholar 

  7. Afonso, S., Bick, E., Haber, R., Santos, D.: Floresta sintá(c)tica: a treebank for Portuguese. In: Proceedings of LREC 2002, pp. 1698–1703 (2002)

    Google Scholar 

  8. Freitas, C., Rocha, P., Bick, E.: Floresta Sintá(c)tica: Bigger, Thicker and Easier. In: Computational Processing of the Portuguese Language Conference (2008)

    Google Scholar 

  9. Petrov, S., Klein, D.: Improved Inference for Unlexicalized Parsing. NAACL (2007)

    Google Scholar 

  10. Petrov, S.: Coarse-to-Fine Natural Language Processing. PHD thesis (2009)

    Google Scholar 

  11. Xue, N., Jiang, Z.: Addendum to the Chinese Treebank Bracketing Guidelines (CTB7.0). Technical paper. University of Pennsylvania (2010)

    Google Scholar 

  12. Kawata, Y., Bartels, J.: Stylebook for the Japanese Treebank in VERBMOBIL. University Tubingen, Report 240 (2000)

    Google Scholar 

  13. Moreno, A., López, S., Alcántara, M.: Spanish Tree Bank: Specifications, Version 5. Technical paper (1999)

    Google Scholar 

  14. Volk, M.: Spanish Expansion of a Parallel Treebank. Technical paper (2009)

    Google Scholar 

  15. Nivre, J., Nilsson, J., Hall, J.: Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In: Proceedings of LREC (2006)

    Google Scholar 

  16. Bies, A., Maamouri, M.: Penn Arabic Treebank Guidelines. Technical report (2003)

    Google Scholar 

  17. Han, C.-H., Han, N.-R., Ko, E.-S.: Bracketing Guidelines for Penn Korean TreeBank. Technical Report, IRCS-01-10 (2001)

    Google Scholar 

  18. Han, C.-H., Han, N.-R., Ko, E.-S., Yi, H., Palmer, M.: Penn Korean Treebank: Development and Evaluation. In: Proceedings of PACLIC (2002)

    Google Scholar 

  19. Wallenberg, J.C., Ingason, A.K., Sigurðsson, E.F., Rögnvaldsson, E.: Icelandic Parsed Historical Corpus (IcePaHC). Version 0.9. Technical report (2011)

    Google Scholar 

  20. Montemagni, S., Barsotti, F., Battista, M., et al.: The Italian Syntactic-Semantic Tree-bank: Architecture, Annotation, Tools and Evaluation. In: Proceedings of the COLING Workshop on Linguistically Interpreted Corpora, pp. 18–27 (2000)

    Google Scholar 

  21. Montemagni, S., Barsotti, F., Battista, M., et al.: Building the Italian Syntactic-Semantic Treebank. In: Abeillé, A. (ed.) Building and using Parsed Corpora. Language and Speech series, ch. 11, pp. 189–210. Kluwer, Dordrecht (2003)

    Google Scholar 

  22. Galves, C., Faria, P.: Tycho Brahe Parsed Corpus of Historical Portuguese. Technical (2010)

    Google Scholar 

  23. Bhatt, R., Farudi, A., Rambow, O.: Hindi-Urdu Phrase Structure Annotation Guidelines. Technical Paper (2012)

    Google Scholar 

  24. Civit, M., Martí, M.A.: Building cast3lb: A Spanish treebank. Research on Language & Computation 2(4), 549–574 (2004)

    Article  Google Scholar 

  25. Taulé, M., Martí, M.A., Recasens, M.: AnCora: Multilevel Annotated Corpora for Catalan and Spanish. In: Proceedings of LREC 2008, Marrakech, Morocco (2008)

    Google Scholar 

  26. Nguyen, P.-T., Vu, X.-L., et al.: Building a large syntactically-annotated corpus of Vietnamese. In: Lingu. Annotation Workshop, pp. 182–185 (2009)

    Google Scholar 

  27. Ruangrajitpakorn, T., Trakultaweekoon, K., Supnithi, T.: A syntactic resource for thai. 2009. CG treebank. In: Workshop on Asian Language Resources, pp. 96–101 (2009)

    Google Scholar 

  28. Sima’an, K., Itai, A., Winter, Y., Altman, A., Nativ, N.: Building a tree-bank of modern Hebrew text. Journal Traitement Automatique des Langues. Special Issue on Natural Language Processing and Corpus Linguistics 42(2), 347–380 (2001)

    Google Scholar 

  29. Petrov, S., Barrett, L., Thibaux, R., Klein, D.: Learning Accurate, Compact, and Interpretable Tree Annotation. In: COLING and 44th ACL, pp. 433–440 (2006)

    Google Scholar 

  30. Han, A.L.-F., Wong, D.F., Chao, L.S., He, L., Li, S., Zhu, L.: Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation. In: Gurevych, I., Biemann, C., Zesch, T. (eds.) GSCL 2013. LNCS, vol. 8105, pp. 119–131. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  31. Van Valin, R.D., Lapolla, R.J.: Syntax, Structure, Meaning and Function. Cambridge University Press (2002)

    Google Scholar 

  32. Carnie, A.: Syntax: A Generative Introduction (Introducing Linguistics). Blackwell Publishing (2002)

    Google Scholar 

  33. Newmeyer, F.J.: Possible and Probable Languages: A Generative Perspective on Linguistic Typology. Oxford University Press (2005)

    Google Scholar 

  34. Rambow, O., Dorr, B., Farwell, D., et al.: Parallel syntactic annotation of multiple languages. In: Proceedings of LREC (2006)

    Google Scholar 

  35. Petrov, S., Das, D., McDonald, R.: A Universal Part-of-Speech Tagset. In: Proceedings of the Eighth LREC (2012)

    Google Scholar 

  36. Naseem, T., Chen, H., Barzilay, R., Johnson, M.: Using universal linguistic knowledge to guide grammar induction. In: Proc. of EMNLP (2010)

    Google Scholar 

  37. McDonald, R., Nivre, J., Quirmbach-Brundage, Y., et al.: Universal Dependency Annotation for Multilingual Parsing. In: Proceedings of ACL (2013)

    Google Scholar 

  38. Danish Arboretum corpus. Arboretum: A syntactic tree corpus of Danish, http://corp.hum.sdu.dk/arboretum.html (accessed December 2013)

  39. Hungarian Szeged Treebank. Szeged Treebank 2.0: A Hungarian natural language database with detailed syntactic analysis. Hungarian linguistics at the University of Szeged

    Google Scholar 

  40. Bick, E., Uibo, H., Muischnek, K.: Preliminary experiments for a CG-based syntactic tree corpus of Estonian, http://corp.hum.sdu.dk/tgrepeye_est.html (accessed December 2013)

  41. Swedish Treebank Syntactic Annotation. Swedish Treebank. Online project, http://stp.lingfil.uu.se/~nivre/swedish_treebank/ (accessed March 2014)

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Han, A.LF., Wong, D.F., Chao, L.S., Lu, Y., He, L., Tian, L. (2014). A Universal Phrase Tagset for Multilingual Treebanks. In: Sun, M., Liu, Y., Zhao, J. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2014 2014. Lecture Notes in Computer Science(), vol 8801. Springer, Cham. https://doi.org/10.1007/978-3-319-12277-9_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12277-9_22

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12276-2

  • Online ISBN: 978-3-319-12277-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics