ABSTRACT
The performance of machine learning methods heavily depends on the volume of used training data. For the purpose of dataset enlargement, it is of interest to study the problem of unifying multiple labeled datasets with different annotation standards. In this paper, we focus on the case of unifying datasets for sequence labeling problems with natural language part-of-speech (POS) tagging as an examplar application. To this end, we propose a probabilistic approach to transforming the annotations of one dataset to the standard specified by another dataset. The key component of the approach, named as label correspondence learning, serves as a bridge of annotations from the datasets. Two methods designed from distinct perspectives are proposed to attack this sub-problem. Experiments on two large-scale part-of-speech datasets demonstrate the efficacy of the transformation and label correspondence learning methods.
- D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (Ed.2). Prentice Hall Science in Artificial Intelligence, 2009. Google ScholarDigital Library
- M. Banko and E. Brill. Scaling to very very large corpora for natural language. In Proceeding of ACL, pages 26--33, 2001. Google ScholarDigital Library
- J. K. Low, H. T. Ng, and W. Guo. A maximum entropy approach to chinese word segmentation. In Proceedings of fifth SIGHAN workshop, pages 161--164, 2005.Google Scholar
- A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In Proceeding of Association of Computational Linguistics, pages 133--132, 1996.Google Scholar
- M. Collins. Head-driven statistical models for natural language parsing. Ph.D. Thesis. Penn University, 1999. Google ScholarDigital Library
- S. M. Thede and M. P. Harper. A second-order hidden markov models for part-of-speech. In Proceedings of ACL., pages 175--182, 1999. Google ScholarDigital Library
- N. Xue, F. dong Chiou, and M. Palmer. Building a large-scale annotated chinese corpus. In Proceeding of COLING., pages 1--8, 2002. Google ScholarDigital Library
- Z. qiang Huang. M. P. Harper, and W. Wang. Mandarin part-of-speech tagging and discriminative. In Proceeding of EMNLP-CoNLL., pages 1093--1102, 2007.Google Scholar
- Q. Zhou.Phrase bracketing and annotating on chinese language corpus. (in chinese). Ph.D. Thesis, Beijing University., 1996.Google Scholar
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence. In Proceedings of ICML., pages 282--289, 2001. Google ScholarDigital Library
- J. Nivre. Inductive dependency parsing. In Springer., 34.Google Scholar
- R .Johansson and P. Nugues. Extended constituent-to-dependency conversion for english. In Proceeding of EMNLP-CoNLL., pages 105--112, 2007.Google Scholar
- S. Ekeklint and J. Nivre.A dependency-based conversion of propbank. In Proceeding of FRAME., pages 19--25, 2007.Google Scholar
- P. Kingsbury, M. Palmer, and M. Marcus. Adding semantic annotation to the penn treebank. In Proceeding of HLT., 2002.Google Scholar
- M. Johnson. PCFG models of linguistic tree representations. Computational Linguistics., 24. Google ScholarDigital Library
- W. Jiang, L. Huang, and Q. Liu. Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging - A Case Study. In Proceedings of ACL., pages 522--530, 2009. Google ScholarDigital Library
Index Terms
- Label correspondence learning for part-of-speech annotation transformation
Recommendations
A robust transformation-based learning approach using ripple down rules for part-of-speech tagging
In this paper, we propose a new approach to construct a system of transformation rules for the Part-of-Speech (POS) tagging task. Our approach is based on an incremental knowledge acquisition method where rules are stored in an exception structure and new ...
Lingual-Agnostic Meta-Learning for Low-Resource Part-of-Speech Tagging
ICIT '20: Proceedings of the 2020 8th International Conference on Information Technology: IoT and Smart CityCurrent deep learning based cross-lingual Part-of-Speech (POS) tagging methods are limited by their ability to achieve fast learning and generalization when the data in the target language is scarce. In this paper, we integrate a meta-learning procedure ...
Korean Part-of-speech Tagging Based on Morpheme Generation
Two major problems of Korean part-of-speech (POS) tagging are that the word-spacing unit is not mapped one-to-one to a POS tag and that morphemes should be recovered during POS tagging. Therefore, this article proposes a novel two-step Korean POS tagger ...
Comments