short-paper

Open Access

Toshakhana: A Multidimensional Panjabi Corpus in Gurmukhi Script

Authors:
Arvinder Kang

University of Mississippi, Oxford, Mississippi, USA

University of Mississippi, Oxford, Mississippi, USA

0009-0003-9222-8458
View Profile

,
Thai Le

University of Mississippi, Oxford, Mississippi, USA

University of Mississippi, Oxford, Mississippi, USA

0000-0001-9632-6870
View Profile

,
Yixin Chen

University of Mississippi, Oxford, Mississippi, USA

University of Mississippi, Oxford, Mississippi, USA

0000-0001-7645-674X
View Profile

ACM SE '24: Proceedings of the 2024 ACM Southeast ConferenceApril 2024Pages 278–283https://doi.org/10.1145/3603287.3651197

Published:27 April 2024Publication History

ACM SE '24: Proceedings of the 2024 ACM Southeast Conference

Pages 278–283

ABSTRACT

Panjabi (also referred to as Punjabi) is a name given to a collection of tonal languages originating in the Punjab area of South Asia. It is the ninth most spoken language in the world - roughly 1.9% of the world population. Panjabi is written in two scripts - Gurmukhi and Shahmukhi. Yet it can be considered a "low resource language" due to lack of basic building blocks of Natural Language Processing (NLP) research. Toshakhana is our attempt to build the first Panjabi corpus in Gurmukhi script with temporal component.

References

2017. Jagbani. https://jagbani.punjabkesari.in/Google Scholar
2020. Punjabi-kavita.com. https://www.punjabi-kavita.com/Google Scholar
2022. Ajitjalandhar.com. https://www.ajitjalandhar.com/Google Scholar
2022. Punjabitribuneonline.com. https://www.punjabitribuneonline.com/Google Scholar
Paul Baker, Andrew Hardie, Tony McEnery, and BD Jayaram. 2003. Constructing Corpora of South Asian Languages. In Corpus Linguistics 2003. Lancaster, UK.Google Scholar
Tej K. Bhatia. 1993. Punjabi: A Conginitive-descriptive Grammar. Routledge, New York.Google Scholar
Kulpreet Chilana. 2017. Punjabi Dictionary. https://apps.apple.com/in/app/punjabi-dictionary/id550017294Google Scholar
Peter J. Claus. 2003. South Asian Folklore: An Encyclopedia: Afghanistan. Vol. 1. Routledge, New York.Google Scholar
Nachatter Garcha and Andreu Domingo. 2017. Sikh Diaspora and Spain: Migration, Hypermobility and Space. Diaspora Studies 10 (May 2017), 1--24. https://doi.org/10.1080/09739572.2017.1324385Google ScholarCross Ref
George Abraham Grierson. 1916. Linguistic Survey of India. Vol. 9. Supt. Govt. Printing India, Calcutta. 607--806 pages.Google Scholar
Girish Nath Jha. 2010. The TDIL Program and the Indian Langauge Corpora Intitiative (ILCI). In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10). European Language Resources Association (ELRA), Valletta, Malta. http://www.lrec-conf.org/proceedings/lrec2010/pdf/874_Paper.pdfGoogle Scholar
Gurjot Mahi and Amandeep Verma. 2019. PURAN: Word Prediction System for Punjabi Language News. 383--400. https://doi.org/10.1007/978-981-32-9949-8_26Google ScholarCross Ref
Gurinder Singh Mann. 2001. The Making of Sikh Scripture. Oxford University Press, New York.Google Scholar
Central Institute of Indian Languages. 2019. A Gold Standard Punjabi Raw Text Corpus. https://data.ldcil.org/a-gold-standard-punjabi-raw-text-corpus?search=punjabi&category_id=0Google Scholar
BBC News Punjabi. 2022. BBC News Punjabi. https://www.bbc.com/punjabiGoogle Scholar
Christopher Shackle. 2003. The Indo-Aryan languages. Routledge, London, New York. 581--621 pages.Google Scholar
Atamjit Singh. 1997. The Language Divide in Punjab. South Asian Graduate Research Journal 4, 1 (1997).Google Scholar
Kulbir S. Thind. 2005. Unicode Gurmukhi Fonts and Information. https://www.gurbanifiles.net/unicode/index.htmGoogle Scholar
Kulbir S. Thind. 2006--03. Issues Regarding the Use of Unicode Gurmukhi fonts. http://https://www.gurbanifiles.net/unicode/unicode_issues.htmGoogle Scholar
Vibhijain. 2011. Countries Where Punjabi is Spoken. Wikimedia Commons. https://commons.wikimedia.org/wiki/File:Countries_where_Punjabi_is_spoken.pngGoogle Scholar
Emma Williams. 2008--09. A Comparative Study of the Development of the Gurumukhi Script: From the Handwritten Manuscript to the Digital Typeface.Google Scholar
WorldData. 2022. Geographical Distribution of Languages Worldwide. WorldData. https://www.worlddata.info/languages/index.phpGoogle Scholar

Index Terms

Toshakhana: A Multidimensional Panjabi Corpus in Gurmukhi Script

Recommendations

A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus
NISS '19: Proceedings of the 2nd International Conference on Networking, Information Systems & Security

Part-of-speech (POS) tagging is a fundamental task of Natural Language Processing (NLP). It provides useful information for many other NLP tasks, including word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic ...
Read More
Indic script family and its offline handwriting recognition for characters/digits and words: a comprehensive survey
Abstract
Handwriting recognition has become an active area of research in pattern recognition and machine learning in recent years. Handwriting recognition systems have a variety of applications ranging from digital character conversion to signboard ...
Read More
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
Abstract
Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ACM SE '24: Proceedings of the 2024 ACM Southeast Conference
April 2024
337 pages
ISBN:9798400702372
DOI:10.1145/3603287
Organizing Chair:
Dan Lo,
Program Chair:
Eric Gamess
Copyright © 2024 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 April 2024
Check for updates
Author Tags
corpus
datasets
low resource languages
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
ACM SE '24 Paper Acceptance Rate44of137submissions,32%Overall Acceptance Rate178of377submissions,47%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 11
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Toshakhana: A Multidimensional Panjabi Corpus in Gurmukhi Script

ACM SE '24: Proceedings of the 2024 ACM Southeast Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus

Indic script family and its offline handwriting recognition for characters/digits and words: a comprehensive survey

A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Toshakhana: A Multidimensional Panjabi Corpus in Gurmukhi Script

ACM SE '24: Proceedings of the 2024 ACM Southeast Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus

Indic script family and its offline handwriting recognition for characters/digits and words: a comprehensive survey

A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media