Skip to main content
Log in

Urdu part of speech tagging using conditional random fields

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Part of speech (POS) tagging, the assignment of syntactic categories for words in running text, is significant to natural language processing as a preliminary task in applications such as speech processing, information extraction, and others. Urdu language processing presents a challenge due to the dual behaviour of various Urdu POS tags in differing situations (morphosyntactic ambiguity). This paper addresses this challenge by developing a novel tagging approach using linear-chain conditional random fields (CRF). Our work is the first instance of a CRF approach for Urdu POS tagging. The proposed model employs a strong, stable and balanced language-independent as well as language dependent feature set. The language-dependent feature considered includes part-of-speech tag of the previous word and suffix of the current word while the language-independent features includes the ‘context words window’. Our approach was evaluated against support vector machine techniques for Urdu POS—considered as state of the art—on two benchmark datasets. The results show our CRF approach to improve upon the F-measure of prior attempts by 8.3–8.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://www.cle.org.pk/software/langproc/POStagset.htm.

  2. http://182.180.102.251:8080/tag/.

  3. http://www.cle.org.pk/.

  4. https://github.com/alexandrekow/svmtutorial.

  5. http://crfpp.googlecode.com/svn/trunk/doc/index.html.

References

  • Abbas, Q. (2014). Semi-semantic part of speech annotation and evaluation. In Proceeding of the 8th linguistic annotation workshop, Dublin, Ireland, August 2324 2014 (pp. 75–81).

  • Adeeba, F., & Hussain, S. (2011). Experiences in building the Urdu WordNet. In Proceedings of the 9th workshop on Asian language resources collocated with IJCNLP, Chiang Mai, Thailand (pp. 31–35).

  • Ahmed, T., & Hautli, A. (2011). A first approach towards an Urdu WordNet. Linguistics and Literature Review, 1(1), 1–14.

    Google Scholar 

  • Akram, Q.-U.-A., Naseer, A., & Hussain, S. (2009). Assas-band, an affix-exception-list based Urdu stemmer. In The 7th workshop on Asian language resources (pp. 40–46). Association for Computational Linguistics.

  • Anwar, W., Wang, X., Li, L., & Wang, X.-L. (2007). A statistical based part of speech tagger for Urdu language. In International conference on machine learning and cybernetics (Vol. 6, pp. 3418–3424). IEEE.

  • Anwar, W., Wang, X., & Wang, X.-L. (2006). A survey of automatic Urdu language processing. In International conference on machine learning and cybernetics (pp. 4489–4494). IEEE.

  • Atwell, E. S. (2008). Development of tag sets for part-of-speech tagging. In Anke Lüdeling (Ed.), Corpus Linguistics: An International Handbook (Vol. 1, pp. 501–526). Walter de Gruyter.

  • Benajiba, Y., & Rosso, P. (2008). Arabic named entity recognition using conditional random fields. In Proceedings of workshop on HLT & NLP within the Arabic World, LREC (Vol. 8, pp. 143–153). Citeseer.

  • Biemann, C. (2006). Unsupervised part-of-speech tagging employing efficient graph clustering. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics: student research workshop (pp. 7–12). Association for Computational Linguistics.

  • Bilgin, M., & Amasyali, M. F. (2016). Semantic role labeling with relative clauses. International Journal of Electronics, Mechanical and Mechatronics Engineering, 6(2), 1165–1175.

    Article  Google Scholar 

  • Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the workshop on speech and natural language (pp. 112–116). Association for Computational Linguistics.

  • Brill, E. (1993). Automatic grammar induction and parsing free text: A transformation-based approach. In Proceedings of the workshop on human language technology (pp. 237–242). Association for Computational Linguistics.

  • Brill, E. (1994a). A report of recent progress in transformation-based error-driven learning. In Proceedings of the workshop on human language technology (pp. 256–261). Association for Computational Linguistics.

  • Brill, E. (1994b). Some advances in rule-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA (pp. 722–727).

  • Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4), 543–565.

    Google Scholar 

  • Daniel, J., & James, H. (2009). Speech and Language processing: An introduction to natural language processing. In Computational linguistics and speech recognition (2nd ed.). Englewood Cliffs: Prentice Hall.

  • Daud, A., Khan, W., & Che, D. (2016). Urdu language processing: A survey. Artificial Intelligence Review. https://doi.org/10.1007/s10462-016-9482-x.

    Google Scholar 

  • Giménez, J., & Marquez, L. (2004). SVMTool: A general POS tagger generator based on support vector machines. In Proceedings of the 4th international conference on language resources and evaluation. Citeseer.

  • Graça, J. V., Ganchev, K., Coheur, L., Pereira, F., & Taskar, B. (2011). Controlling complexity in part-of-speech induction. Journal of Artificial Intelligence Research, 41, 527–551.

    Article  Google Scholar 

  • Haq, M. A. (1987). اردو صرف و نخو: Amjuman-e-Taraqqi Urdu (Hind) New Delhi.

  • Hardie, A. (2003). Developing a tagset for automated part-of-speech tagging in Urdu. In The Corpus linguistics 2003 conference. UCREL Technical Papers Volume 16. Department of Linguistics, Lancaster University.

  • Hoefel, G., & Elkan, C. (2008). Learning a two-stage SVM/CRF sequence classifier. In The 17th ACM conference on information and knowledge management (pp. 271–278). ACM.

  • Husain, M. S., Ahamad, F., & Khalid, S. (2013). A language independent approach to develop Urdu stemmer. In: Meghanathan N., Nagamalai D., Chaki N. (eds) Advances in Computing and Information Technology. Advances in Intelligent Systems and Computing, vol 178. Springer, Berlin, Heidelberg.

  • Ijaz, M., & Hussain, S. (2007). Corpus based Urdu lexicon development. In Proceedings of the conference on language technology (CLT07), University of Peshawar, Pakistan (Vol. 73).

  • Jawaid, B., Kamran, A., & Bojar, O. (2014). A tagged Corpus and a tagger for Urdu. In LREC (pp. 2938–2943).

  • Jawaid, B., & Ondřej, B. (2012). Tagger voting for Urdu. In 24th international conference on computational linguistics (p. 135). Citeseer.

  • Khan, W., Daud, A., Nasir, J. A., & Amjad, T. (2016a). A survey on the state-of-the-art machine learning models in the context of NLP. Kuwait Journal of Science, 43(4), 66–84.

    Google Scholar 

  • Khan, W., Daud, A., Nasir, J. A., & Amjad, T. (2016b). Urdu named entity dataset for Urdu named entity recognition task. In 6th International conference on language & technology (pp. 51–55).

  • Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Paper presented at the eighteenth international conference on machine learning, ICML.

  • Maimaiti, M., Wumaier, A., Abiderexiti, K., & Yibulayin, T. (2017). Bidirectional long short-term memory network with a conditional random field layer for Uyghur part-of-speech tagging. Information, 8(4), 157.

    Article  Google Scholar 

  • Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.

    Google Scholar 

  • Muaz, A., Ali, A., & Hussain, S. (2009). Analysis and development of Urdu POS tagged corpus. In Proceedings of the 7th workshop on Asian language resources (pp. 24–29). Association for Computational Linguistics.

  • Mukund, S. (2012). An NLP framework for non-topical text analysis in UrduA resource poor language. ProQuest LLC., Ph.D. Dissertation, State University of New York at Buffalo.

  • Mukund, S., Srihari, R., & Peterson, E. (2010). An information-extraction system for Urdu—A resource-poor language. ACM Transactions on Asian Language Information Processing (TALIP), 9(4), 1–43.

    Article  Google Scholar 

  • Naz, F., Anwar, W., Bajwa, U. I., & Munir, E. U. (2012). Urdu part of speech tagging using transformation based error driven learning. World Applied Sciences Journal, 16(3), 437–448.

    Google Scholar 

  • Phuong, N. D., & Chau, V. T. N. (2016). Automatic de-identification of medical records with a multilevel hybrid semi-supervised learning approach. In Computing & communication technologies, research, innovation, and vision for the future (RIVF), 2016 IEEE RIVF international conference on (pp. 43–48). IEEE.

  • Platts, J. T. (1909). A grammar of the Hindustani or Urdu language. London: WH Allen.

    Google Scholar 

  • Raymond, C., & Riccardi, G. (2007). Generative and discriminative algorithms for spoken language understanding. In INTERSPEECH (pp. 1605–1608).

  • Roth, D., & Zelenko, D. (1998). Part of speech tagging using a network of linear separators. In Proceedings of the 17th international conference on computational linguisticsVolume 2 (pp. 1136–1142). Association for Computational Linguistics.

  • Saha, S. K., Sarkar, S., & Mitra, P. (2008). A hybrid feature set based maximum entropy Hindi named entity recognition. In IJCNLP (pp. 343–349).

  • Sajjad, H. (2007). Statistical part of speech tagger for Urdu. Unpublished Master’s Thesis. National University of Computer & Emerging Sciences. Lahore, Pakistan.

  • Sajjad, H., & Schmid, H. (2009). Tagging Urdu text with parts of speech: A tagger comparison. In Proceedings of the 12th conference of the European chapter of the association for computational linguistics (pp. 692–700). Association for Computational Linguistics.

  • Schmidt, R. L. (1999). Urdu: An Essential Grammar. London: Routledge Publishing.

  • Sharjeel, M., Nawab, R. M. A., & Rayson, P. (2017). COUNTER: Corpus of Urdu news text reuse. Language Resources and Evaluation, 51(3), 777–803.

    Article  Google Scholar 

  • Silfverberg, M., Ruokolainen, T., Linden, K., & Kurimo, M. (2014). Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers) 2014, Baltimore, Maryland.

  • Song, S., Zhang, N., & Huang, H. (2017). Named entity recognition based on conditional random fields. Cluster Computing. https://doi.org/10.1007/s10586-017-1146-3

  • Tafseer, A., Urooj, S., Hussain, S., Mustafa, A., Parveen, R., Adeeba, F., et al. (2015). The CLE Urdu POS tagset. In LREC 2014, ninth international conference on language resources and evaluation (pp. 2920–2925).

  • Yin, Y., Wei, F., Dong, L., Xu, K., Zhang, M., & Zhou, M. (2016). Unsupervised word and dependency path embeddings for aspect term extraction. arXiv preprint arXiv:1605.07843.

  • Žitnik, S., Šubelj, L., & Bajec, M. (2014). SkipCor: Skip-mention coreference resolution using linear-chain conditional random fields. PLoS ONE, 9(6), 1–14.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Daud.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khan, W., Daud, A., Nasir, J.A. et al. Urdu part of speech tagging using conditional random fields. Lang Resources & Evaluation 53, 331–362 (2019). https://doi.org/10.1007/s10579-018-9439-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-018-9439-6

Keywords

Navigation