skip to main content
10.1145/3603287.3651197acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
short-paper
Open Access

Toshakhana: A Multidimensional Panjabi Corpus in Gurmukhi Script

Published:27 April 2024Publication History

ABSTRACT

Panjabi (also referred to as Punjabi) is a name given to a collection of tonal languages originating in the Punjab area of South Asia. It is the ninth most spoken language in the world - roughly 1.9% of the world population. Panjabi is written in two scripts - Gurmukhi and Shahmukhi. Yet it can be considered a "low resource language" due to lack of basic building blocks of Natural Language Processing (NLP) research. Toshakhana is our attempt to build the first Panjabi corpus in Gurmukhi script with temporal component.

References

  1. 2017. Jagbani. https://jagbani.punjabkesari.in/Google ScholarGoogle Scholar
  2. 2020. Punjabi-kavita.com. https://www.punjabi-kavita.com/Google ScholarGoogle Scholar
  3. 2022. Ajitjalandhar.com. https://www.ajitjalandhar.com/Google ScholarGoogle Scholar
  4. 2022. Punjabitribuneonline.com. https://www.punjabitribuneonline.com/Google ScholarGoogle Scholar
  5. Paul Baker, Andrew Hardie, Tony McEnery, and BD Jayaram. 2003. Constructing Corpora of South Asian Languages. In Corpus Linguistics 2003. Lancaster, UK.Google ScholarGoogle Scholar
  6. Tej K. Bhatia. 1993. Punjabi: A Conginitive-descriptive Grammar. Routledge, New York.Google ScholarGoogle Scholar
  7. Kulpreet Chilana. 2017. Punjabi Dictionary. https://apps.apple.com/in/app/punjabi-dictionary/id550017294Google ScholarGoogle Scholar
  8. Peter J. Claus. 2003. South Asian Folklore: An Encyclopedia: Afghanistan. Vol. 1. Routledge, New York.Google ScholarGoogle Scholar
  9. Nachatter Garcha and Andreu Domingo. 2017. Sikh Diaspora and Spain: Migration, Hypermobility and Space. Diaspora Studies 10 (May 2017), 1--24. https://doi.org/10.1080/09739572.2017.1324385Google ScholarGoogle ScholarCross RefCross Ref
  10. George Abraham Grierson. 1916. Linguistic Survey of India. Vol. 9. Supt. Govt. Printing India, Calcutta. 607--806 pages.Google ScholarGoogle Scholar
  11. Girish Nath Jha. 2010. The TDIL Program and the Indian Langauge Corpora Intitiative (ILCI). In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10). European Language Resources Association (ELRA), Valletta, Malta. http://www.lrec-conf.org/proceedings/lrec2010/pdf/874_Paper.pdfGoogle ScholarGoogle Scholar
  12. Gurjot Mahi and Amandeep Verma. 2019. PURAN: Word Prediction System for Punjabi Language News. 383--400. https://doi.org/10.1007/978-981-32-9949-8_26Google ScholarGoogle ScholarCross RefCross Ref
  13. Gurinder Singh Mann. 2001. The Making of Sikh Scripture. Oxford University Press, New York.Google ScholarGoogle Scholar
  14. Central Institute of Indian Languages. 2019. A Gold Standard Punjabi Raw Text Corpus. https://data.ldcil.org/a-gold-standard-punjabi-raw-text-corpus?search=punjabi&category_id=0Google ScholarGoogle Scholar
  15. BBC News Punjabi. 2022. BBC News Punjabi. https://www.bbc.com/punjabiGoogle ScholarGoogle Scholar
  16. Christopher Shackle. 2003. The Indo-Aryan languages. Routledge, London, New York. 581--621 pages.Google ScholarGoogle Scholar
  17. Atamjit Singh. 1997. The Language Divide in Punjab. South Asian Graduate Research Journal 4, 1 (1997).Google ScholarGoogle Scholar
  18. Kulbir S. Thind. 2005. Unicode Gurmukhi Fonts and Information. https://www.gurbanifiles.net/unicode/index.htmGoogle ScholarGoogle Scholar
  19. Kulbir S. Thind. 2006--03. Issues Regarding the Use of Unicode Gurmukhi fonts. http://https://www.gurbanifiles.net/unicode/unicode_issues.htmGoogle ScholarGoogle Scholar
  20. Vibhijain. 2011. Countries Where Punjabi is Spoken. Wikimedia Commons. https://commons.wikimedia.org/wiki/File:Countries_where_Punjabi_is_spoken.pngGoogle ScholarGoogle Scholar
  21. Emma Williams. 2008--09. A Comparative Study of the Development of the Gurumukhi Script: From the Handwritten Manuscript to the Digital Typeface.Google ScholarGoogle Scholar
  22. WorldData. 2022. Geographical Distribution of Languages Worldwide. WorldData. https://www.worlddata.info/languages/index.phpGoogle ScholarGoogle Scholar

Index Terms

  1. Toshakhana: A Multidimensional Panjabi Corpus in Gurmukhi Script

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ACM SE '24: Proceedings of the 2024 ACM Southeast Conference
          April 2024
          337 pages
          ISBN:9798400702372
          DOI:10.1145/3603287

          Copyright © 2024 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 April 2024

          Check for updates

          Qualifiers

          • short-paper
          • Research
          • Refereed limited

          Acceptance Rates

          ACM SE '24 Paper Acceptance Rate44of137submissions,32%Overall Acceptance Rate178of377submissions,47%
        • Article Metrics

          • Downloads (Last 12 months)11
          • Downloads (Last 6 weeks)11

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader