Abstract
Two major problems of Korean part-of-speech (POS) tagging are that the word-spacing unit is not mapped one-to-one to a POS tag and that morphemes should be recovered during POS tagging. Therefore, this article proposes a novel two-step Korean POS tagger that solves the problems. This tagger first generates a sequence of lemmatized and recovered morphemes that can be mapped one-to-one to a POS tag using an encoder-decoder architecture derived from a POS-tagged corpus. Then, the POS tag of each morpheme in the generated sequence is finally determined by a standard sequence labeling method. Since the knowledge for segmenting and recovering morphemes is extracted automatically from a POS-tagged corpus by an encoder-decoder architecture, the POS tagger is constructed without a dictionary nor handcrafted linguistic rules. The experimental results on a standard dataset show that the proposed method outperforms existing POS taggers with its state-of-the-art performance.
- Dae-Ho Baek, Ho Lee, and Hae-Chang Rim. 1995. A structure of Korean electronic dictionary using the finite state transducer. In Proceedings of the 1995 Conference on Hangul and Korean Information Processing. 181--187.Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.Google Scholar
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1724--1734.Google ScholarCross Ref
- Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 1693--1703.Google ScholarCross Ref
- Cícero Nogueira dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning. 1818--1826.Google Scholar
- Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neur. Netw. 18, 5 (2005), 602--610.Google ScholarDigital Library
- Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 1631--1640.Google Scholar
- Georg Heigold, Guenter Neumann, and Josef van Genabith. 2016. Neural morphological tagging from characters for morphologically rich languages. CoRR abs/1606.06640 (2016). arxiv:1606.06640 http://arxiv.org/abs/1606.06640.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780.Google ScholarDigital Library
- Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).Google Scholar
- Sangkeun Jung, Changki Lee, and Hyunsun Hwang. 2018. End-to-end Korean part-of-speech tagging using copying mechanism. ACM Trans. Asian Low-Resource Lang. Inf. Process. 17, 3 (2018), 19:1--19:8.Google Scholar
- Seung-Shik Kang. 1995. Morphological analysis of Korean irregular verbs using syllable characteristics. J. Kor. Inf. Sci. Soc. 22, 10 (1995), 1480--1487. [in Korean]Google Scholar
- Cheol-Su Kim, Woo-jeong Bae, Yong-seok Lee, and Jun-ichi Aoe. 1996. Construction of Korean electronic dictionary using double-array trie structure. J. Kor. Inf. Sci. Soc. 23, 1 (1996), 85--94. [in Korean]Google Scholar
- Deok-Bong Kim, Sung-Jin Lee, Key-Sun Choi, and Gil-Chang Kim. 1994. A two-level morphological analysis of Korean. In Proceedings of the 15th Conference on Computational Linguistics, Volume 1. 535--539.Google ScholarDigital Library
- Seong-Yong Kim. 1987. A Morphological Analyzer for Korean Language with Tabular Parsing Method and Connectivity Information. Master’s thesis. KAIST.Google Scholar
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Kimmo Koskenniemi. 1983. Two-level model for morphological analysis. In Proceedings of the 8th International Joint Conference on Artificial Intelligence. 683--685.Google Scholar
- Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 230--237.Google Scholar
- Oh-Woog Kwon, Yujin Chung, Mi-Young Kim, Dong-Won Ryu, Moon-Ki Lee, and Jong-Hyeok Lee. 1999. Korean morphological analyzer and part-of-speech tagger based on CYK algorithm using syllable information. In Proceedings of the MATEC Web Conferences. 76--88.Google Scholar
- John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. 282--289.Google ScholarDigital Library
- Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 260--270.Google ScholarCross Ref
- Chung-Hee Lee, Joon-Ho Lim, Soojong Lim, and Hyun-Ki Kim. 2016. Syllable-based Korean POS tagging based on combining a pre-analyzed dictionary with machine learning. J. Kor. Inst. Inf. Sci. Eng. 43, 3 (2016), 362--369. [in Korean]Google Scholar
- Dongjoo Lee, Jongheum Yeon, and Sang-goo Lee. 2011. A unified probablistic model for correcting spacing errors and improving accuracy of morphological analysis of Korean sentences. In Proceedings of Korea Computer Congress 2011. 237--240. [in Korean]Google Scholar
- Do-Gil Lee and Hae-Chang Rim. 2009. Probabilistic modeling of Korean morphology. IEEE Trans. Aud. Speech Lang. Process. 17, 5 (2009), 945--955.Google ScholarDigital Library
- Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2016. Fully character-level neural machine translation without explicit segmentation. arXiv preprint arXiv:1610.03017 (2016).Google Scholar
- Jae-Sung Lee. 2011. Three-step probabilistic model for Korean morphological analysis. J. Kor. Inst. Inf. Sci. Eng. Softw. Appl. 38, 5 (2011), 257--268. [in Korean]Google Scholar
- Heui-Seok Lim Lim, Sang-Zoo Lee, and Hae-Chang Rim. 1995. An efficient Korean morphological analysis using exclusive information. In Proceedings of the International Conference on Computer Processing of Oriental Language. 255--258.Google Scholar
- Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 11--19.Google ScholarCross Ref
- Christopher D. Manning. 2011. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing. 171--189.Google ScholarCross Ref
- Andrew Matteson, Chanhee Lee, Youngbum Kim, and Heuiseok Lim. 2018. Rich character-level information for Korean morphological analysis and part-of-speech tagging. In Proceedings of the 27th International Conference on Computational Linguistics. 2482--2492.Google Scholar
- Seung-Hoon Na. 2015. Conditional random fields for Korean morpheme segmentation and POS tagging. ACM Trans. Asian Low-Resource Lang. Inf. Process. 14, 3 (2015), 10:1--10:16.Google Scholar
- Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (1989), 257--286.Google ScholarCross Ref
- Kwangseob Shim and Jaehyung Yang. 2002. MACH: A supersonic Korean morphological analyzer. In Proceedings of the 19th International Conference on Computational Linguistics. 939--945.Google ScholarDigital Library
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958.Google ScholarDigital Library
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of Advances in Neural Information Processing Systems. 3104--3112.Google ScholarDigital Library
- Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technology. 173--180.Google ScholarDigital Library
Index Terms
- Korean Part-of-speech Tagging Based on Morpheme Generation
Recommendations
Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging
There has been recent interest in statistical approaches to Korean morphological analysis. However, previous studies have been based mostly on generative models, including a hidden Markov model (HMM), without utilizing discriminative models such as a ...
Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of Korean
Most errors in Korean morphological analysis and part-of-speech (POS) tagging are caused by unknown morphemes. This paper presents a syllable-pattern-based generalized unknown-morpheme-estimation method with POSTAG (POStech TAGger), which is a ...
A Cross-lingual Part-of-Speech Tagging for Malay Language
ICAART 2015: Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2Cross-lingual annotation projection methods can benefit from rich-resourced languages to improve the performance
of Natural Language Processing (NLP) tasks in less-resourced languages. In this research, Malay
is experimented as the less-resourced ...
Comments