ABSTRACT
Panjabi (also referred to as Punjabi) is a name given to a collection of tonal languages originating in the Punjab area of South Asia. It is the ninth most spoken language in the world - roughly 1.9% of the world population. Panjabi is written in two scripts - Gurmukhi and Shahmukhi. Yet it can be considered a "low resource language" due to lack of basic building blocks of Natural Language Processing (NLP) research. Toshakhana is our attempt to build the first Panjabi corpus in Gurmukhi script with temporal component.
- 2017. Jagbani. https://jagbani.punjabkesari.in/Google Scholar
- 2020. Punjabi-kavita.com. https://www.punjabi-kavita.com/Google Scholar
- 2022. Ajitjalandhar.com. https://www.ajitjalandhar.com/Google Scholar
- 2022. Punjabitribuneonline.com. https://www.punjabitribuneonline.com/Google Scholar
- Paul Baker, Andrew Hardie, Tony McEnery, and BD Jayaram. 2003. Constructing Corpora of South Asian Languages. In Corpus Linguistics 2003. Lancaster, UK.Google Scholar
- Tej K. Bhatia. 1993. Punjabi: A Conginitive-descriptive Grammar. Routledge, New York.Google Scholar
- Kulpreet Chilana. 2017. Punjabi Dictionary. https://apps.apple.com/in/app/punjabi-dictionary/id550017294Google Scholar
- Peter J. Claus. 2003. South Asian Folklore: An Encyclopedia: Afghanistan. Vol. 1. Routledge, New York.Google Scholar
- Nachatter Garcha and Andreu Domingo. 2017. Sikh Diaspora and Spain: Migration, Hypermobility and Space. Diaspora Studies 10 (May 2017), 1--24. https://doi.org/10.1080/09739572.2017.1324385Google ScholarCross Ref
- George Abraham Grierson. 1916. Linguistic Survey of India. Vol. 9. Supt. Govt. Printing India, Calcutta. 607--806 pages.Google Scholar
- Girish Nath Jha. 2010. The TDIL Program and the Indian Langauge Corpora Intitiative (ILCI). In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10). European Language Resources Association (ELRA), Valletta, Malta. http://www.lrec-conf.org/proceedings/lrec2010/pdf/874_Paper.pdfGoogle Scholar
- Gurjot Mahi and Amandeep Verma. 2019. PURAN: Word Prediction System for Punjabi Language News. 383--400. https://doi.org/10.1007/978-981-32-9949-8_26Google ScholarCross Ref
- Gurinder Singh Mann. 2001. The Making of Sikh Scripture. Oxford University Press, New York.Google Scholar
- Central Institute of Indian Languages. 2019. A Gold Standard Punjabi Raw Text Corpus. https://data.ldcil.org/a-gold-standard-punjabi-raw-text-corpus?search=punjabi&category_id=0Google Scholar
- BBC News Punjabi. 2022. BBC News Punjabi. https://www.bbc.com/punjabiGoogle Scholar
- Christopher Shackle. 2003. The Indo-Aryan languages. Routledge, London, New York. 581--621 pages.Google Scholar
- Atamjit Singh. 1997. The Language Divide in Punjab. South Asian Graduate Research Journal 4, 1 (1997).Google Scholar
- Kulbir S. Thind. 2005. Unicode Gurmukhi Fonts and Information. https://www.gurbanifiles.net/unicode/index.htmGoogle Scholar
- Kulbir S. Thind. 2006--03. Issues Regarding the Use of Unicode Gurmukhi fonts. http://https://www.gurbanifiles.net/unicode/unicode_issues.htmGoogle Scholar
- Vibhijain. 2011. Countries Where Punjabi is Spoken. Wikimedia Commons. https://commons.wikimedia.org/wiki/File:Countries_where_Punjabi_is_spoken.pngGoogle Scholar
- Emma Williams. 2008--09. A Comparative Study of the Development of the Gurumukhi Script: From the Handwritten Manuscript to the Digital Typeface.Google Scholar
- WorldData. 2022. Geographical Distribution of Languages Worldwide. WorldData. https://www.worlddata.info/languages/index.phpGoogle Scholar
Index Terms
- Toshakhana: A Multidimensional Panjabi Corpus in Gurmukhi Script
Recommendations
A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus
NISS '19: Proceedings of the 2nd International Conference on Networking, Information Systems & SecurityPart-of-speech (POS) tagging is a fundamental task of Natural Language Processing (NLP). It provides useful information for many other NLP tasks, including word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic ...
Indic script family and its offline handwriting recognition for characters/digits and words: a comprehensive survey
AbstractHandwriting recognition has become an active area of research in pattern recognition and machine learning in recent years. Handwriting recognition systems have a variety of applications ranging from digital character conversion to signboard ...
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
AbstractWord Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Comments