Urdu part of speech tagging using conditional random fields

Khan, Wahab; Daud, Ali; Nasir, Jamal Abdul; Amjad, Tehmina; Arafat, Sachi; Aljohani, Naif; Alotaibi, Fahd S.

doi:10.1007/s10579-018-9439-6

Urdu part of speech tagging using conditional random fields

Original Paper
Published: 01 December 2018

Volume 53, pages 331–362, (2019)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Wahab Khan¹,
Ali Daud^1,2,
Jamal Abdul Nasir¹,
Tehmina Amjad¹,
Sachi Arafat²,
Naif Aljohani² &
…
Fahd S. Alotaibi²

1111 Accesses
14 Citations
1 Altmetric
Explore all metrics

Abstract

Part of speech (POS) tagging, the assignment of syntactic categories for words in running text, is significant to natural language processing as a preliminary task in applications such as speech processing, information extraction, and others. Urdu language processing presents a challenge due to the dual behaviour of various Urdu POS tags in differing situations (morphosyntactic ambiguity). This paper addresses this challenge by developing a novel tagging approach using linear-chain conditional random fields (CRF). Our work is the first instance of a CRF approach for Urdu POS tagging. The proposed model employs a strong, stable and balanced language-independent as well as language dependent feature set. The language-dependent feature considered includes part-of-speech tag of the previous word and suffix of the current word while the language-independent features includes the ‘context words window’. Our approach was evaluated against support vector machine techniques for Urdu POS—considered as state of the art—on two benchmark datasets. The results show our CRF approach to improve upon the F-measure of prior attempts by 8.3–8.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Abbas, Q. (2014). Semi-semantic part of speech annotation and evaluation. In Proceeding of the 8th linguistic annotation workshop, Dublin, Ireland, August 23–24 2014 (pp. 75–81).
Adeeba, F., & Hussain, S. (2011). Experiences in building the Urdu WordNet. In Proceedings of the 9th workshop on Asian language resources collocated with IJCNLP, Chiang Mai, Thailand (pp. 31–35).
Ahmed, T., & Hautli, A. (2011). A first approach towards an Urdu WordNet. Linguistics and Literature Review, 1(1), 1–14.
Google Scholar
Akram, Q.-U.-A., Naseer, A., & Hussain, S. (2009). Assas-band, an affix-exception-list based Urdu stemmer. In The 7th workshop on Asian language resources (pp. 40–46). Association for Computational Linguistics.
Anwar, W., Wang, X., Li, L., & Wang, X.-L. (2007). A statistical based part of speech tagger for Urdu language. In International conference on machine learning and cybernetics (Vol. 6, pp. 3418–3424). IEEE.
Anwar, W., Wang, X., & Wang, X.-L. (2006). A survey of automatic Urdu language processing. In International conference on machine learning and cybernetics (pp. 4489–4494). IEEE.
Atwell, E. S. (2008). Development of tag sets for part-of-speech tagging. In Anke Lüdeling (Ed.), Corpus Linguistics: An International Handbook (Vol. 1, pp. 501–526). Walter de Gruyter.
Benajiba, Y., & Rosso, P. (2008). Arabic named entity recognition using conditional random fields. In Proceedings of workshop on HLT & NLP within the Arabic World, LREC (Vol. 8, pp. 143–153). Citeseer.
Biemann, C. (2006). Unsupervised part-of-speech tagging employing efficient graph clustering. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics: student research workshop (pp. 7–12). Association for Computational Linguistics.
Bilgin, M., & Amasyali, M. F. (2016). Semantic role labeling with relative clauses. International Journal of Electronics, Mechanical and Mechatronics Engineering, 6(2), 1165–1175.
Article Google Scholar
Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the workshop on speech and natural language (pp. 112–116). Association for Computational Linguistics.
Brill, E. (1993). Automatic grammar induction and parsing free text: A transformation-based approach. In Proceedings of the workshop on human language technology (pp. 237–242). Association for Computational Linguistics.
Brill, E. (1994a). A report of recent progress in transformation-based error-driven learning. In Proceedings of the workshop on human language technology (pp. 256–261). Association for Computational Linguistics.
Brill, E. (1994b). Some advances in rule-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA (pp. 722–727).
Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4), 543–565.
Google Scholar
Daniel, J., & James, H. (2009). Speech and Language processing: An introduction to natural language processing. In Computational linguistics and speech recognition (2nd ed.). Englewood Cliffs: Prentice Hall.
Daud, A., Khan, W., & Che, D. (2016). Urdu language processing: A survey. Artificial Intelligence Review. https://doi.org/10.1007/s10462-016-9482-x.
Google Scholar
Giménez, J., & Marquez, L. (2004). SVMTool: A general POS tagger generator based on support vector machines. In Proceedings of the 4th international conference on language resources and evaluation. Citeseer.
Graça, J. V., Ganchev, K., Coheur, L., Pereira, F., & Taskar, B. (2011). Controlling complexity in part-of-speech induction. Journal of Artificial Intelligence Research, 41, 527–551.
Article Google Scholar
Haq, M. A. (1987). اردو صرف و نخو: Amjuman-e-Taraqqi Urdu (Hind) New Delhi.
Hardie, A. (2003). Developing a tagset for automated part-of-speech tagging in Urdu. In The Corpus linguistics 2003 conference. UCREL Technical Papers Volume 16. Department of Linguistics, Lancaster University.
Hoefel, G., & Elkan, C. (2008). Learning a two-stage SVM/CRF sequence classifier. In The 17th ACM conference on information and knowledge management (pp. 271–278). ACM.
Husain, M. S., Ahamad, F., & Khalid, S. (2013). A language independent approach to develop Urdu stemmer. In: Meghanathan N., Nagamalai D., Chaki N. (eds) Advances in Computing and Information Technology. Advances in Intelligent Systems and Computing, vol 178. Springer, Berlin, Heidelberg.
Ijaz, M., & Hussain, S. (2007). Corpus based Urdu lexicon development. In Proceedings of the conference on language technology (CLT07), University of Peshawar, Pakistan (Vol. 73).
Jawaid, B., Kamran, A., & Bojar, O. (2014). A tagged Corpus and a tagger for Urdu. In LREC (pp. 2938–2943).
Jawaid, B., & Ondřej, B. (2012). Tagger voting for Urdu. In 24th international conference on computational linguistics (p. 135). Citeseer.
Khan, W., Daud, A., Nasir, J. A., & Amjad, T. (2016a). A survey on the state-of-the-art machine learning models in the context of NLP. Kuwait Journal of Science, 43(4), 66–84.
Google Scholar
Khan, W., Daud, A., Nasir, J. A., & Amjad, T. (2016b). Urdu named entity dataset for Urdu named entity recognition task. In 6th International conference on language & technology (pp. 51–55).
Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Paper presented at the eighteenth international conference on machine learning, ICML.
Maimaiti, M., Wumaier, A., Abiderexiti, K., & Yibulayin, T. (2017). Bidirectional long short-term memory network with a conditional random field layer for Uyghur part-of-speech tagging. Information, 8(4), 157.
Article Google Scholar
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Google Scholar
Muaz, A., Ali, A., & Hussain, S. (2009). Analysis and development of Urdu POS tagged corpus. In Proceedings of the 7th workshop on Asian language resources (pp. 24–29). Association for Computational Linguistics.
Mukund, S. (2012). An NLP framework for non-topical text analysis in Urdu—A resource poor language. ProQuest LLC., Ph.D. Dissertation, State University of New York at Buffalo.
Mukund, S., Srihari, R., & Peterson, E. (2010). An information-extraction system for Urdu—A resource-poor language. ACM Transactions on Asian Language Information Processing (TALIP), 9(4), 1–43.
Article Google Scholar
Naz, F., Anwar, W., Bajwa, U. I., & Munir, E. U. (2012). Urdu part of speech tagging using transformation based error driven learning. World Applied Sciences Journal, 16(3), 437–448.
Google Scholar
Phuong, N. D., & Chau, V. T. N. (2016). Automatic de-identification of medical records with a multilevel hybrid semi-supervised learning approach. In Computing & communication technologies, research, innovation, and vision for the future (RIVF), 2016 IEEE RIVF international conference on (pp. 43–48). IEEE.
Platts, J. T. (1909). A grammar of the Hindustani or Urdu language. London: WH Allen.
Google Scholar
Raymond, C., & Riccardi, G. (2007). Generative and discriminative algorithms for spoken language understanding. In INTERSPEECH (pp. 1605–1608).
Roth, D., & Zelenko, D. (1998). Part of speech tagging using a network of linear separators. In Proceedings of the 17th international conference on computational linguistics—Volume 2 (pp. 1136–1142). Association for Computational Linguistics.
Saha, S. K., Sarkar, S., & Mitra, P. (2008). A hybrid feature set based maximum entropy Hindi named entity recognition. In IJCNLP (pp. 343–349).
Sajjad, H. (2007). Statistical part of speech tagger for Urdu. Unpublished Master’s Thesis. National University of Computer & Emerging Sciences. Lahore, Pakistan.
Sajjad, H., & Schmid, H. (2009). Tagging Urdu text with parts of speech: A tagger comparison. In Proceedings of the 12th conference of the European chapter of the association for computational linguistics (pp. 692–700). Association for Computational Linguistics.
Schmidt, R. L. (1999). Urdu: An Essential Grammar. London: Routledge Publishing.
Sharjeel, M., Nawab, R. M. A., & Rayson, P. (2017). COUNTER: Corpus of Urdu news text reuse. Language Resources and Evaluation, 51(3), 777–803.
Article Google Scholar
Silfverberg, M., Ruokolainen, T., Linden, K., & Kurimo, M. (2014). Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers) 2014, Baltimore, Maryland.
Song, S., Zhang, N., & Huang, H. (2017). Named entity recognition based on conditional random fields. Cluster Computing. https://doi.org/10.1007/s10586-017-1146-3
Tafseer, A., Urooj, S., Hussain, S., Mustafa, A., Parveen, R., Adeeba, F., et al. (2015). The CLE Urdu POS tagset. In LREC 2014, ninth international conference on language resources and evaluation (pp. 2920–2925).
Yin, Y., Wei, F., Dong, L., Xu, K., Zhang, M., & Zhou, M. (2016). Unsupervised word and dependency path embeddings for aspect term extraction. arXiv preprint arXiv:1605.07843.
Žitnik, S., Šubelj, L., & Bajec, M. (2014). SkipCor: Skip-mention coreference resolution using linear-chain conditional random fields. PLoS ONE, 9(6), 1–14.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Software Engineering, IIU, Islamabad, 44000, Pakistan
Wahab Khan, Ali Daud, Jamal Abdul Nasir & Tehmina Amjad
Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
Ali Daud, Sachi Arafat, Naif Aljohani & Fahd S. Alotaibi

Authors

Wahab Khan
View author publications
You can also search for this author in PubMed Google Scholar
Ali Daud
View author publications
You can also search for this author in PubMed Google Scholar
Jamal Abdul Nasir
View author publications
You can also search for this author in PubMed Google Scholar
Tehmina Amjad
View author publications
You can also search for this author in PubMed Google Scholar
Sachi Arafat
View author publications
You can also search for this author in PubMed Google Scholar
Naif Aljohani
View author publications
You can also search for this author in PubMed Google Scholar
Fahd S. Alotaibi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Daud.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khan, W., Daud, A., Nasir, J.A. et al. Urdu part of speech tagging using conditional random fields. Lang Resources & Evaluation 53, 331–362 (2019). https://doi.org/10.1007/s10579-018-9439-6

Download citation

Published: 01 December 2018
Issue Date: 15 September 2019
DOI: https://doi.org/10.1007/s10579-018-9439-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Urdu part of speech tagging using conditional random fields

Abstract

Access this article

Similar content being viewed by others

Khmer POS Tagging Using Conditional Random Fields

Accurate Part-of-Speech Tagging via Conditional Random Field

Kannpos-Kannada Parts of Speech Tagger Using Conditional Random Fields

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Urdu part of speech tagging using conditional random fields

Abstract

Access this article

Similar content being viewed by others

Khmer POS Tagging Using Conditional Random Fields

Accurate Part-of-Speech Tagging via Conditional Random Field

Kannpos-Kannada Parts of Speech Tagger Using Conditional Random Fields

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation