Skip to main content
Log in

Corpus based part-of-speech tagging

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In natural language processing, a crucial subsystem in a wide range of applications is a part-of-speech (POS) tagger, which labels (or classifies) unannotated words of natural language with POS labels corresponding to categories such as noun, verb or adjective. Mainstream approaches are generally corpus-based: a POS tagger learns from a corpus of pre-annotated data how to correctly tag unlabeled data. Presented here is a brief state-of-the-art account on POS tagging. POS tagging approaches make use of labeled corpus to train computational trained models. Several typical models of three kings of tagging are introduced in this article: rule-based tagging, statistical approaches and evolution algorithms. The advantages and the pitfalls of each typical tagging are discussed and analyzed. Some rule-based and stochastic methods have been successfully achieved accuracies of 93–96 %, while that of some evolution algorithms are about 96–97 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Araujo, L. (2001). Evolutionary parsing for a probabilistic context free grammar. In Rough sets and current trends in computing, Canada (pp. 590–597). Berlin: Springer.

  • Araujo, L. (2002). Part-of-speech tagging with evolutionary algorithms. In Third International conference on computational linguistics and intelligent text processing, Mexico City, Mexico (pp. 187–203).

  • Bohnet, B., & Nivre, J. (2012). A transition-based system for joint part-of-speech tagging and labeled non-projective dependency parsing. In Joint conference on empirical methods in natural language processing & computational natural language learning, Jeju Island, Korea (pp. 1455–1465).

  • Brants, T. (2000). TnT: a statistical part-of-speech tagger. In Proceedings of the sixth applied natural language processing conference, Seattle, WA (pp. 224–231). Trento: Association for Computational Linguistics.

  • Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the third conference on applied computational linguistics (pp. 112–116). Trento: Association for Computational Linguistics.

  • Brill, E. (1995). Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Computational Linguistics, 21(4), 543–565.

    Google Scholar 

  • Carlberger, J., & Kann, V. (1999). Implementing an efficient part-of-speech tagger. Software-Practice and Experience, 29(9), 815–832.

    Article  Google Scholar 

  • Charniak, E., Hendrickson, C., et al. (1993). Equations for part-of-speech tagging. In AAAI-93, Proceedings (pp. 784–784). New York: Wiley.

  • Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines: and other kernel-based learning methods. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Cutting, D., Kupiec, J., et al. (1992). A practical part-of-speech tagger (pp. 133–140). Trendo: Association for Computational Linguistics.

    Google Scholar 

  • Davis, M., & Dunning, T. (1995). Query translation using evolutionary programming for multi-lingual information retrieval. In Proceedings of the fourth annual conference on evolutionary programming (pp. 175–185).

  • Ferreira, C. (2001). Gene expression programming: a new adaptive algorithm for solving problems. Arxiv preprint cs/0102027.

  • Ferreira, C. (2003). Function finding and the creation of numerical constants in gene expression programming. In Advances in soft computing, 265.

  • Garrette, D., & Baldridge, J. (2013). Learning a part-of-speech tagger from two hours of annotation. In Proceedings of NAACL, Atlanta, Georgia (pp. 129–134).

  • Giménez, J., & Marquez, L. (2004). SVMTool: A general POS tagger generator based on support vector machines. In Proceedings of the 4th international conference on language resources and evaluation (LREC’04), Citeseer.

  • Goldberg, D. E. (1989). Genetic algorithms in search, optimization, and machine learning. Addison: Wesley.

  • Greene, B. B., & Rubin, G. M. (1971). Automatic grammatical tagging of English. Department of Linguistics, Brown University.

  • Jamatia, A., Gamblack, B., & Das, A. (2015). Part-of-speech tagging for code-mixed english-hindi twitter and facebook chat messages. In Proceedings of recent advances in natural language processing (pp. 239–248). Hissar.

  • Jing, P., Changjie, T., et al. (2005). M-GEP: a new evolution algorithm based on multi-layer chromosomes gene expression programming. Chinese Journal of Computers, 28(9), 1459–1466.

    Google Scholar 

  • Karakasis, V. K., & Stafylopatis, A. (2008). Efficient evolution of accurate classification rules using a combination of gene expression programming and clonal selection. IEEE Transactions on Evolutionary Computation, 12(6), 662–678.

    Article  Google Scholar 

  • Karkaletsis, G., Petasis, G., & Paliouras, V. (2015). Using machine learning techniques for part-of-speech tagging in the Greek language. Singapore: World Scientific Publishing Company.

    Google Scholar 

  • Kempe, A. (1993). A probabilistic tagger and an analysis of tagging errors. Rapport technique, Institut für maschinelle sprachverarbeitung, Universität stuttgart.

  • Krovetz, R. (1997). Homonymy and polysemy in information retrieval. In Meeting of the Association for Computational Linguistics (pp. 72–79). Trendo: Association for Computational Linguistics.

  • Lee, S. Z., Tsujii, J. I., & Rim, H. C. (2000). Lexicalized hidden markov models for part-of-speech tagging. In International conference on computational linguistics (pp. 481–487). Trendo: Association for Computational Linguistics.

  • Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural Computation, 1(1), 1–38.

    Article  Google Scholar 

  • Lv, C., Liu, H., et al. (2010). An efficient corpus based part-of-speech tagging with GEP. In Sixth international conference on semantics, knowledge and grids (pp. 289–292). IEEE.

  • Magerman, D. M. (1995). Statistical decision-tree models for parsing. In Meeting of the Association for Computational Linguistics (pp. 276–283). Trendo: Association for Computational Linguistics.

  • Manning, C. D., Schütze, H., et al. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.

    MATH  Google Scholar 

  • Marques, N., & Lopes, G. (2001). Tagging with small training corpora. In International symposium on advances in intelligent data analysis (pp. 63–72). Berlin: Springer.

  • Màrquez, L., Padro, L., et al. (2000). A machine learning approach to POS tagging. Machine Learning, 39(1), 59–91.

    Article  MATH  Google Scholar 

  • Martinez, A. R. (2012). Part-of-speech tagging. Wiley Interdisciplinary Reviews, 4(1), 107–113.

    Article  Google Scholar 

  • Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20(2), 155–171.

    Google Scholar 

  • Nakagawa, T., Kudoh, T., et al. (2001). Unknown word guessing and part-of-speech tagging using support vector machines. In Proceedings of the sixth natural language processing pacific rim symposium (pp. 325–331).

  • Nakamura, M., Maruyama, K., et al. (1990). Neural network approach to word category prediction for English texts. In International conference on computational linguistics (pp. 213–218). Trendo: Association for Computational Linguistics.

  • Ngai, G., & Florian, R. (2001). Transformation-based learning in the fast lane. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies (pp. 1–8).

  • Owoputi, O., O’Connor, B., & Dyer, C. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL, Atlanta (pp. 380–390).

  • Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. In Proceedings of the IEEE (vol. 77(2), pp. 257–286).

  • Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of EMNLP’1996, New Brunswick, New Jersey (vol. 1, pp. 133–142).

  • Sánchez-Villamil, E., Forcada, M., et al. (2004). Unsupervised training of a finite-state sliding-window part-of-speech tagger. EsTAL, 2004, 454–463.

    Google Scholar 

  • Schmid, H. (1994). Part-of-speech tagging with neural networks. In International conference on computational linguistics (pp. 172–176). Trendo: Association for Computational Linguistics.

  • Smith, T. C., & Witten, I. H. (1995). A genetic algorithm for the induction of natural language grammars. In Proc IJCAI-95 workshop on new approaches to learning for natural language processing (pp. 17–24).

  • Sun, G., Lang, F., & Qiao P. (2008). Chinese part-of-speech tagging based on fusion model. In Proceedings of the 11th joint conference on information sciences. Amsterdam: Atlantis Press.

  • Thede, S. M., & Harper, M. P. (1999). A second-order hidden Markov model for part-of-speech tagging. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 175–182).

  • Varile, G. B., & Zampolli, A. (1997). Survey of the state of the art in human language technology. Cambridge: Cambridge University Press.

    Google Scholar 

  • Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.

    Article  MATH  Google Scholar 

  • Voutilainen, A. (2003). Part-of-speech tagging. The Oxford handbook of computational linguistics (pp. 219–232).

  • Wilks, Y., & Stevenson, M. (2000). Combining independent knowledge sources for word sense disambiguation. Amsterdam Studies in the Theory and History of Linguistic Science Series, 4, 117–130.

    Google Scholar 

  • Tian, Y., & Lo, D. (2015). A comparative study on the effectiveness of part-of-speech tagging techniques on bug reports. In International conference on software analysis, evolution and reengineering (pp. 570–574). Montréal.

  • Zhou, C., Xiao, W., et al. (2003). Evolving accurate and compact classification rules with gene expression programming. IEEE Transactions on Evolutionary Computation, 7(6), 519–531.

    Article  Google Scholar 

  • Zuo, J., Tang, C., et al. (2002). Mining predicate association rule by gene expression programming. In Advances in web-age information management (pp. 281–294).

  • Zuo, J., Tang, C., et al. (2004). Time series prediction based on gene expression programming. In Advances in web-age information management (pp. 55–64).

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Nos. 61440018, 61501411), the Hubei Natural Science Foundation (No. 2014CFB904).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunliang Chen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lv, C., Liu, H., Dong, Y. et al. Corpus based part-of-speech tagging. Int J Speech Technol 19, 647–654 (2016). https://doi.org/10.1007/s10772-016-9356-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-016-9356-2

Keywords

Navigation