skip to main content
short-paper

Korean Part-of-speech Tagging Based on Morpheme Generation

Published:09 January 2020Publication History
Skip Abstract Section

Abstract

Two major problems of Korean part-of-speech (POS) tagging are that the word-spacing unit is not mapped one-to-one to a POS tag and that morphemes should be recovered during POS tagging. Therefore, this article proposes a novel two-step Korean POS tagger that solves the problems. This tagger first generates a sequence of lemmatized and recovered morphemes that can be mapped one-to-one to a POS tag using an encoder-decoder architecture derived from a POS-tagged corpus. Then, the POS tag of each morpheme in the generated sequence is finally determined by a standard sequence labeling method. Since the knowledge for segmenting and recovering morphemes is extracted automatically from a POS-tagged corpus by an encoder-decoder architecture, the POS tagger is constructed without a dictionary nor handcrafted linguistic rules. The experimental results on a standard dataset show that the proposed method outperforms existing POS taggers with its state-of-the-art performance.

References

  1. Dae-Ho Baek, Ho Lee, and Hae-Chang Rim. 1995. A structure of Korean electronic dictionary using the finite state transducer. In Proceedings of the 1995 Conference on Hangul and Korean Information Processing. 181--187.Google ScholarGoogle Scholar
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.Google ScholarGoogle Scholar
  3. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1724--1734.Google ScholarGoogle ScholarCross RefCross Ref
  4. Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 1693--1703.Google ScholarGoogle ScholarCross RefCross Ref
  5. Cícero Nogueira dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning. 1818--1826.Google ScholarGoogle Scholar
  6. Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neur. Netw. 18, 5 (2005), 602--610.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 1631--1640.Google ScholarGoogle Scholar
  8. Georg Heigold, Guenter Neumann, and Josef van Genabith. 2016. Neural morphological tagging from characters for morphologically rich languages. CoRR abs/1606.06640 (2016). arxiv:1606.06640 http://arxiv.org/abs/1606.06640.Google ScholarGoogle Scholar
  9. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).Google ScholarGoogle Scholar
  11. Sangkeun Jung, Changki Lee, and Hyunsun Hwang. 2018. End-to-end Korean part-of-speech tagging using copying mechanism. ACM Trans. Asian Low-Resource Lang. Inf. Process. 17, 3 (2018), 19:1--19:8.Google ScholarGoogle Scholar
  12. Seung-Shik Kang. 1995. Morphological analysis of Korean irregular verbs using syllable characteristics. J. Kor. Inf. Sci. Soc. 22, 10 (1995), 1480--1487. [in Korean]Google ScholarGoogle Scholar
  13. Cheol-Su Kim, Woo-jeong Bae, Yong-seok Lee, and Jun-ichi Aoe. 1996. Construction of Korean electronic dictionary using double-array trie structure. J. Kor. Inf. Sci. Soc. 23, 1 (1996), 85--94. [in Korean]Google ScholarGoogle Scholar
  14. Deok-Bong Kim, Sung-Jin Lee, Key-Sun Choi, and Gil-Chang Kim. 1994. A two-level morphological analysis of Korean. In Proceedings of the 15th Conference on Computational Linguistics, Volume 1. 535--539.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Seong-Yong Kim. 1987. A Morphological Analyzer for Korean Language with Tabular Parsing Method and Connectivity Information. Master’s thesis. KAIST.Google ScholarGoogle Scholar
  16. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  17. Kimmo Koskenniemi. 1983. Two-level model for morphological analysis. In Proceedings of the 8th International Joint Conference on Artificial Intelligence. 683--685.Google ScholarGoogle Scholar
  18. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 230--237.Google ScholarGoogle Scholar
  19. Oh-Woog Kwon, Yujin Chung, Mi-Young Kim, Dong-Won Ryu, Moon-Ki Lee, and Jong-Hyeok Lee. 1999. Korean morphological analyzer and part-of-speech tagger based on CYK algorithm using syllable information. In Proceedings of the MATEC Web Conferences. 76--88.Google ScholarGoogle Scholar
  20. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. 282--289.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 260--270.Google ScholarGoogle ScholarCross RefCross Ref
  22. Chung-Hee Lee, Joon-Ho Lim, Soojong Lim, and Hyun-Ki Kim. 2016. Syllable-based Korean POS tagging based on combining a pre-analyzed dictionary with machine learning. J. Kor. Inst. Inf. Sci. Eng. 43, 3 (2016), 362--369. [in Korean]Google ScholarGoogle Scholar
  23. Dongjoo Lee, Jongheum Yeon, and Sang-goo Lee. 2011. A unified probablistic model for correcting spacing errors and improving accuracy of morphological analysis of Korean sentences. In Proceedings of Korea Computer Congress 2011. 237--240. [in Korean]Google ScholarGoogle Scholar
  24. Do-Gil Lee and Hae-Chang Rim. 2009. Probabilistic modeling of Korean morphology. IEEE Trans. Aud. Speech Lang. Process. 17, 5 (2009), 945--955.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2016. Fully character-level neural machine translation without explicit segmentation. arXiv preprint arXiv:1610.03017 (2016).Google ScholarGoogle Scholar
  26. Jae-Sung Lee. 2011. Three-step probabilistic model for Korean morphological analysis. J. Kor. Inst. Inf. Sci. Eng. Softw. Appl. 38, 5 (2011), 257--268. [in Korean]Google ScholarGoogle Scholar
  27. Heui-Seok Lim Lim, Sang-Zoo Lee, and Hae-Chang Rim. 1995. An efficient Korean morphological analysis using exclusive information. In Proceedings of the International Conference on Computer Processing of Oriental Language. 255--258.Google ScholarGoogle Scholar
  28. Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 11--19.Google ScholarGoogle ScholarCross RefCross Ref
  29. Christopher D. Manning. 2011. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing. 171--189.Google ScholarGoogle ScholarCross RefCross Ref
  30. Andrew Matteson, Chanhee Lee, Youngbum Kim, and Heuiseok Lim. 2018. Rich character-level information for Korean morphological analysis and part-of-speech tagging. In Proceedings of the 27th International Conference on Computational Linguistics. 2482--2492.Google ScholarGoogle Scholar
  31. Seung-Hoon Na. 2015. Conditional random fields for Korean morpheme segmentation and POS tagging. ACM Trans. Asian Low-Resource Lang. Inf. Process. 14, 3 (2015), 10:1--10:16.Google ScholarGoogle Scholar
  32. Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (1989), 257--286.Google ScholarGoogle ScholarCross RefCross Ref
  33. Kwangseob Shim and Jaehyung Yang. 2002. MACH: A supersonic Korean morphological analyzer. In Proceedings of the 19th International Conference on Computational Linguistics. 939--945.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of Advances in Neural Information Processing Systems. 3104--3112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technology. 173--180.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Korean Part-of-speech Tagging Based on Morpheme Generation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 3
        May 2020
        228 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3378675
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 January 2020
        • Accepted: 1 November 2019
        • Revised: 1 July 2019
        • Received: 1 September 2017
        Published in tallip Volume 19, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format