Skip to main content
Log in

A feature-based approach to better automatic treebank conversion

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In the field of constituency parsing, there exist multiple human-labeled treebanks which are built on non-overlapping text samples and follow different annotation standards. Due to the extreme cost of annotating parse trees by human, it is desirable to automatically convert one treebank (called source treebank) to the standard of another treebank (called target treebank) which we are interested in. Conversion results can be manually corrected to obtain higher-quality annotations or can be directly used as additional training data for building syntactic parsers. To perform automatic treebank conversion, we divide constituency parses into two separate levels: the part-of-speech (POS) and syntactic structure (bracketing structures and constituent labels), and conduct conversion on these two levels respectively with a feature-based approach. The basic idea of the approach is to encode original annotations in a source treebank as guide features during the conversion process. Experiments on two Chinese treebanks show that our approach can convert POS tags and syntactic structures with the accuracy of 96.6 and 84.8 %, respectively, which are the best reported results on this task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. This treebank is not publicly available at present; a brief description (in Chinese) of this corpus is available by following the http://ccl.pku.edu.cn:8080/WebTreebank/WebTreebank_Readme.html.

  2. http://www.cipsc.org.cn/clp2010/task2_en.htm.

  3. http://nlp.cs.nyu.edu/evalb.

  4. DEC and DEG are POS tags for the Chinese particle ‘de’. DEC means that ‘de’ plays the role of a complementizer and DEG means ‘de’ is a genitive marker.

  5. http://www.cis.upenn.edu/∼dbikel/software.html#comparator.

References

  • Bikel, D. M. (2004). On the parameter space of generative lexicalized statistical parsing models. Ph.D. thesis, University of Pennsylvania.

  • Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the 11th conference on computational learning theory (COLT 1998). Madison, Wisconsin, USA, July 24–26, 1998.

  • Charniak, E., Goldwater, S., & Johnson, M. (1998). Edge-based best-first chart parsing. In Proceedings of the ACL 1998 workshop on very large corpora. Montreal, Quebec, Canada, August 15–16, 1998.

  • Chen, W., Kazama, J., Uchimoto, K., & Torisawa, K. (2009). Improving dependency parsing with subtrees from auto-parsed data. In Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP 2009). Singapore, Singapore, Auguest 6–7, 2009.

  • Collins, M. (1999). Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania.

  • Collins, M. (2002). Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithm. In Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002). Philadelphia, PA, USA, July 6–7, 2002.

  • Collins, M., & Roark, B. (2004). Incremental parsing with the perceptron algorithm. In Proceedings of the 42nd annual meeting of the assofication for computational linguistics (ACL 2004). Barcelona, Spain, July 21–26, 2004.

  • Daumé, H. III, Marcu, D. (2006). Adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26(1), 101–166.

    Google Scholar 

  • Huang, L. (2008). Forest reranking: Discriminative parsing with non-local features. In Proceedings of the 46th annual meeting of the association for computational linguistics (ACL 2008). Columbus, Ohio, USA, June 15–20, 2008.

  • Jiang, W., & Liu, Q. (2009). Automatic adaptation of annotation standards for dependency parsing—using projected treebank as source corpus. In Proceedings of the 11th international conference on parsing technologies (IWPT 2009). Paris, France, October 7–9, 2009.

  • Jiang, W., Huang, L., & Liu, Q. (2009). Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging—a case study. In Proceedings of the 47th annual meeting of the association for computational linguistics and 5th international joint conference on natural language processing of the asian federation of natural language processing (ACL-IJCNLP 2009). Singapore, Singapore, August 2–7, 2009.

  • Lafferty, J., McCallun, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 8th international conference on machine learning (ICML 2001). Williamstown, MA, USA, June 28–July 1, 2001.

  • Low, J. K., Ng, H. T., & Guo, W. (2005). A maximum entropy approach to Chinese word segmentation. In Proceedings of the 5th SINGHAN Workshop (SIGHAN 2005). October 14–15, 2005.

  • Martins, A., Das, D., Smith, N., & Xing, E. (2008). Stack dependency parsers. In Proceedings of the 2008 conference on empirical methods in natural language processing (EMNLP 2008). Honolulu, Hawaii, USA, October 25–27, 2008.

  • McClosky D., Charniak, E., & Johnson, M. (2006). Effective self-training for parsing. In Proceedings of human language technologies and North American chapter of the association for computational linguistics. HLT-NAACL 2006, New York, USA, June 4–9, 2006.

  • Niu, Z.-Y., Wang, H., & Wu, H. (2009). Exploiting heterogeneous treebanks for parsing. In Proceedings of the 47th annual meeting of the association for computational linguistics and 5th international joint conference on natural language processing of the asian federation of natural language processing (ACL-IJCNLP 2009). Singapore, Singapore, August 2–7, 2009.

  • Nivre, J., & McDonald, R. (2008). Integrating graph-based and transition-based dependency parsers. In Proceedings of the 46th annual meeting of the association for computational linguistics (ACL 2008). Ohio, USA, June 15–20, 2008.

  • Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. In Proceedings of North American chapter of the association for computational linguistics (NAACL 2007). New York, USA, April 22–27, 2007.

  • Petrov, S., Chang, P.-C., Ringgaard, M., & Alshawi, H. (2010). Uptraining for accurate deterministic question parsing. In Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP 2010). Cambridge, Massachusetts, USA, October 9–11, 2010.

  • Sagae, K., & Lavie, A. (2006a). A best-first probabilistic shift-reduce parser. In Proceedings of the 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING-ACL 2006). Sydney, Australia, July 17–21, 2006.

  • Sagae, K., & Lavie, A. (2006b). Parser combination by reparsing. In Proceedings of human language technologies and North American chapter of the association for computational linguistics. HLT-NAACL 2006, New York, USA, June 4–9, 2006.

  • Wang, J.-N., Chang, J.-S., & Su, K.-Y. (1994). An automatic treebank conversion algorithm for corpus sharing. In Proceedings of the 32nd annual meeting of the association for computational linguistics (ACL 1994). Las Cruces, New Mexico, USA, June 27–30, 1994.

  • Xue, N., Xia, F., Chiou, F., & Palmer, M. (2005). The Penn Chinese treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207–238.

    Article  Google Scholar 

  • Zhang, Y., & Clark, S. (2009). Transition-based parsing of the Chinese treebank using a global discriminative model. In Proceedings of 11th international conference on parsing technologies (IWPT 2009). Paris, France, October 7–9, 2009.

  • Zhou, Q. (1996). Phrase bracketing and annotation on Chinese language corpus (in Chinese). Ph.D. thesis, Peking University.

  • Zhu, M., Zhu, J., & Xiao, T. (2011a). Automatic treebank conversion via informed decoding—a case study on Chinese treebanks. ACM Transaction on Asian Language Information Processing, 10(3), 1–24.

    Google Scholar 

  • Zhu, M., Zhu, J., & Hu, M. (2011b). Better automatic treebank conversion using a feature-based approach. In Proceedings of the 49th annual meeting of the association for computational linguistics–human language technologies (ACL-HLT 2011). Portland, Oregon, June 19–24, 2011.

  • Zhu, M., Zhu, J., & Wang, H. (2012). Exploiting lexical dependencies from large-scale unlabeled data for better shift-reduce constituency parsing. In Proceedings of the 24th international conference on computational linguistics (COLING 2012). Mumbai, India, December 8–15, 2012.

Download references

Acknowledgments

This work was supported in part by the National Science Foundation of China (61073140; 61272376; 61100089), Specialized Research Fund for the Doctoral Program of Higher Education (20100042110031), and the Fundamental Research Funds for the Central Universities (N110404012; N100204002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingbo Zhu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, M., Zhu, J. & Wang, H. A feature-based approach to better automatic treebank conversion. Lang Resources & Evaluation 47, 1213–1231 (2013). https://doi.org/10.1007/s10579-013-9234-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-013-9234-3

Keywords

Navigation