A Universal Phrase Tagset for Multilingual Treebanks

Han, Aaron Li-Feng; Wong, Derek F.; Chao, Lidia S.; Lu, Yi; He, Liangye; Tian, Liang

doi:10.1007/978-3-319-12277-9_22

Aaron Li-Feng Han^21,22,
Derek F. Wong²¹,
Lidia S. Chao²¹,
Yi Lu²¹,
Liangye He²¹ &
…
Liang Tian²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8801))

Included in the following conference series:

1613 Accesses
3 Altmetric

Abstract

Many syntactic treebanks and parser toolkits are developed in the past twenty years, including dependency structure parsers and phrase structure parsers. For the phrase structure parsers, they usually utilize different phrase tagsets for different languages, which results in an inconvenience when conducting the multilingual research. This paper designs a refined universal phrase tagset that contains 9 commonly used phrase categories. Furthermore, the mapping covers 25 constituent treebanks and 21 languages. The experiments show that the universal phrase tagset can generally reduce the costs in the parsing models and even improve the parsing accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Towards a Universal Grammar for Natural Language Processing

Multilingual Unsupervised Dependency Parsing with Unsupervised POS Tags

Preliminary Study on the Construction of Bilingual Phrase Structure Treebank

References

Xia, F., Palmer, M., Xue, N., et al.: Developing Guidelines and Ensuring Consistency for Chinese Text Annotation. In: Proceedings of LREC (2000)
Google Scholar
Xue, N., Xia, F., Chiou, F.-D., Palmer, M.: The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus. Natural Language Engineering 11(2), 207–238 (2005)
Article Google Scholar
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the penn treebank. Comput. Linguist. 19(2), 313–330 (1993)
Google Scholar
Bies, A., Ferguson, M., Katz, K., MacIntyre, R.: Bracketing Guidelines for Treebank II Style Penn Treebank Project. Technical paper (1995)
Google Scholar
Skut, W., Krenn, B., Brants, T., Uszkoreit, H.: An annotation scheme for free word order languages. In: Conference on ANLP (1997)
Google Scholar
Abeillé, A., Clément, L., Toussenel, F.: Building a Treebank for French. Building and Using Parsed Corpora. Kluwer Academic Publishers (2003)
Google Scholar
Afonso, S., Bick, E., Haber, R., Santos, D.: Floresta sintá(c)tica: a treebank for Portuguese. In: Proceedings of LREC 2002, pp. 1698–1703 (2002)
Google Scholar
Freitas, C., Rocha, P., Bick, E.: Floresta Sintá(c)tica: Bigger, Thicker and Easier. In: Computational Processing of the Portuguese Language Conference (2008)
Google Scholar
Petrov, S., Klein, D.: Improved Inference for Unlexicalized Parsing. NAACL (2007)
Google Scholar
Petrov, S.: Coarse-to-Fine Natural Language Processing. PHD thesis (2009)
Google Scholar
Xue, N., Jiang, Z.: Addendum to the Chinese Treebank Bracketing Guidelines (CTB7.0). Technical paper. University of Pennsylvania (2010)
Google Scholar
Kawata, Y., Bartels, J.: Stylebook for the Japanese Treebank in VERBMOBIL. University Tubingen, Report 240 (2000)
Google Scholar
Moreno, A., López, S., Alcántara, M.: Spanish Tree Bank: Specifications, Version 5. Technical paper (1999)
Google Scholar
Volk, M.: Spanish Expansion of a Parallel Treebank. Technical paper (2009)
Google Scholar
Nivre, J., Nilsson, J., Hall, J.: Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In: Proceedings of LREC (2006)
Google Scholar
Bies, A., Maamouri, M.: Penn Arabic Treebank Guidelines. Technical report (2003)
Google Scholar
Han, C.-H., Han, N.-R., Ko, E.-S.: Bracketing Guidelines for Penn Korean TreeBank. Technical Report, IRCS-01-10 (2001)
Google Scholar
Han, C.-H., Han, N.-R., Ko, E.-S., Yi, H., Palmer, M.: Penn Korean Treebank: Development and Evaluation. In: Proceedings of PACLIC (2002)
Google Scholar
Wallenberg, J.C., Ingason, A.K., Sigurðsson, E.F., Rögnvaldsson, E.: Icelandic Parsed Historical Corpus (IcePaHC). Version 0.9. Technical report (2011)
Google Scholar
Montemagni, S., Barsotti, F., Battista, M., et al.: The Italian Syntactic-Semantic Tree-bank: Architecture, Annotation, Tools and Evaluation. In: Proceedings of the COLING Workshop on Linguistically Interpreted Corpora, pp. 18–27 (2000)
Google Scholar
Montemagni, S., Barsotti, F., Battista, M., et al.: Building the Italian Syntactic-Semantic Treebank. In: Abeillé, A. (ed.) Building and using Parsed Corpora. Language and Speech series, ch. 11, pp. 189–210. Kluwer, Dordrecht (2003)
Google Scholar
Galves, C., Faria, P.: Tycho Brahe Parsed Corpus of Historical Portuguese. Technical (2010)
Google Scholar
Bhatt, R., Farudi, A., Rambow, O.: Hindi-Urdu Phrase Structure Annotation Guidelines. Technical Paper (2012)
Google Scholar
Civit, M., Martí, M.A.: Building cast3lb: A Spanish treebank. Research on Language & Computation 2(4), 549–574 (2004)
Article Google Scholar
Taulé, M., Martí, M.A., Recasens, M.: AnCora: Multilevel Annotated Corpora for Catalan and Spanish. In: Proceedings of LREC 2008, Marrakech, Morocco (2008)
Google Scholar
Nguyen, P.-T., Vu, X.-L., et al.: Building a large syntactically-annotated corpus of Vietnamese. In: Lingu. Annotation Workshop, pp. 182–185 (2009)
Google Scholar
Ruangrajitpakorn, T., Trakultaweekoon, K., Supnithi, T.: A syntactic resource for thai. 2009. CG treebank. In: Workshop on Asian Language Resources, pp. 96–101 (2009)
Google Scholar
Sima’an, K., Itai, A., Winter, Y., Altman, A., Nativ, N.: Building a tree-bank of modern Hebrew text. Journal Traitement Automatique des Langues. Special Issue on Natural Language Processing and Corpus Linguistics 42(2), 347–380 (2001)
Google Scholar
Petrov, S., Barrett, L., Thibaux, R., Klein, D.: Learning Accurate, Compact, and Interpretable Tree Annotation. In: COLING and 44th ACL, pp. 433–440 (2006)
Google Scholar
Han, A.L.-F., Wong, D.F., Chao, L.S., He, L., Li, S., Zhu, L.: Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation. In: Gurevych, I., Biemann, C., Zesch, T. (eds.) GSCL 2013. LNCS, vol. 8105, pp. 119–131. Springer, Heidelberg (2013)
Chapter Google Scholar
Van Valin, R.D., Lapolla, R.J.: Syntax, Structure, Meaning and Function. Cambridge University Press (2002)
Google Scholar
Carnie, A.: Syntax: A Generative Introduction (Introducing Linguistics). Blackwell Publishing (2002)
Google Scholar
Newmeyer, F.J.: Possible and Probable Languages: A Generative Perspective on Linguistic Typology. Oxford University Press (2005)
Google Scholar
Rambow, O., Dorr, B., Farwell, D., et al.: Parallel syntactic annotation of multiple languages. In: Proceedings of LREC (2006)
Google Scholar
Petrov, S., Das, D., McDonald, R.: A Universal Part-of-Speech Tagset. In: Proceedings of the Eighth LREC (2012)
Google Scholar
Naseem, T., Chen, H., Barzilay, R., Johnson, M.: Using universal linguistic knowledge to guide grammar induction. In: Proc. of EMNLP (2010)
Google Scholar
McDonald, R., Nivre, J., Quirmbach-Brundage, Y., et al.: Universal Dependency Annotation for Multilingual Parsing. In: Proceedings of ACL (2013)
Google Scholar
Danish Arboretum corpus. Arboretum: A syntactic tree corpus of Danish, http://corp.hum.sdu.dk/arboretum.html (accessed December 2013)
Hungarian Szeged Treebank. Szeged Treebank 2.0: A Hungarian natural language database with detailed syntactic analysis. Hungarian linguistics at the University of Szeged
Google Scholar
Bick, E., Uibo, H., Muischnek, K.: Preliminary experiments for a CG-based syntactic tree corpus of Estonian, http://corp.hum.sdu.dk/tgrepeye_est.html (accessed December 2013)
Swedish Treebank Syntactic Annotation. Swedish Treebank. Online project, http://stp.lingfil.uu.se/~nivre/swedish_treebank/ (accessed March 2014)

Download references

Author information

Authors and Affiliations

NLP2CT lab, Department of CIS, University of Macau, China
Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yi Lu, Liangye He & Liang Tian
ILLC, University of Amsterdam, The Netherland
Aaron Li-Feng Han

Authors

Aaron Li-Feng Han
View author publications
You can also search for this author in PubMed Google Scholar
Derek F. Wong
View author publications
You can also search for this author in PubMed Google Scholar
Lidia S. Chao
View author publications
You can also search for this author in PubMed Google Scholar
Yi Lu
View author publications
You can also search for this author in PubMed Google Scholar
Liangye He
View author publications
You can also search for this author in PubMed Google Scholar
Liang Tian
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Haidian District, 100084, Beijing, China
Maosong Sun & Yang Liu &
Chinese Academy of Sciences, Institute of Automation, 100190, Beijing, China
Jun Zhao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, A.LF., Wong, D.F., Chao, L.S., Lu, Y., He, L., Tian, L. (2014). A Universal Phrase Tagset for Multilingual Treebanks. In: Sun, M., Liu, Y., Zhao, J. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2014 2014. Lecture Notes in Computer Science(), vol 8801. Springer, Cham. https://doi.org/10.1007/978-3-319-12277-9_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-12277-9_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12276-2
Online ISBN: 978-3-319-12277-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Universal Phrase Tagset for Multilingual Treebanks

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Towards a Universal Grammar for Natural Language Processing

Multilingual Unsupervised Dependency Parsing with Unsupervised POS Tags

Preliminary Study on the Construction of Bilingual Phrase Structure Treebank

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Universal Phrase Tagset for Multilingual Treebanks

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Towards a Universal Grammar for Natural Language Processing

Multilingual Unsupervised Dependency Parsing with Unsupervised POS Tags

Preliminary Study on the Construction of Bilingual Phrase Structure Treebank

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation