skip to main content
research-article

Konkani WordNet: Corpus-Based Enhancement using Crowdsourcing

Published: 04 March 2022 Publication History

Abstract

Konkani is one of the languages included in the eighth schedule of the Indian constitution. It is the official language of Goa and is spoken mainly in Goa and some places in Karnataka and Kerala. Konkani WordNet or Konkani Shabdamalem (kōṁkanī śabdamālēṁ) as it has been referred to, was developed under the Indradhanush WordNet Project Consortium during the period from August 2010 to October 2013. This project was funded by Technology Development for Indian Languages (TDIL), Department of Electronics & Information Technology (Deity), and Ministry of Communication and Information Technology (MCIT). The work on Konkani WordNet has halted since the end of the project. Currently, the Konkani WordNet contains around 32,370 synsets. However, to make it a powerful resource for NLP applications in the Konkani language, a need is felt for research work toward enhancement of the Konkani WordNet via community involvement. Crowdsourcing is a technique in which the knowledge of the crowd is utilized to accomplish a particular task.
In this article, we have presented the details of the crowdsourcing platform named “Konkani Shabdarth” (kōṁkanī śabdārth). Konkani Shabdarth attempts to use the knowledge of Konkani speaking people for creating new synsets and perform the quantitative enhancement of the wordnet. It also intends to work toward enhancing the overall quality of the Konkani WordNet by validating the existing synsets, and adding the missing words to the existing synsets. A text corpus named “Konkani Shabdarth Corpus”, has been created from the Konkani literature while implementing the Konkani Shabdarth tool. Using this corpus, 572 root words that are missing from the Konkani WordNet have been identified which are given as input to Konkani Shabdarth. As of now, total 94 users have registered on the platform, out of which 25 users have actually played the game. Currently, 71 new synsets have been obtained for 21 words. For some of the words, multiple entries for the concept definition have been received. This overlap is essential for automating the process of validating the synsets. Due to the pandemic period, it has been difficult to train and get players to actually play the game and contribute. We studied the impact of adding missing words from other existing Konkani text corpus on the coverage of Konkani WordNet. The expected increase in the percentage coverage of Konkani WordNet has been found to be in the range 20–27 after adding the missing words from the Konkani Shabdarth corpus in comparison to the other corpora for which the increase is in the range 1–10.

References

[1]
Konkani language. Available: Retrieved from https://en.wikipedia.org/wiki/Konkanilanguage, [Accessed August 15, 2020].
[2]
Are Konkani speakers declining? Available: https://www.goa365.tv/general/N/are-konkani-speakers-declining-no-rising-in-konkani-states/03857.html, 05 Jul 2018, [Accessed August 17, 2020].
[3]
Girish Nath Jha. 2012. The TDIL program and the Indian language corpora initiative (ILCI). In Proceedings of the Language Resources and Evaluation Conference.
[4]
Shilpa Desai, Jyoti Pawar, and Pushpak Bhattacharya. 2012. Automated paradigm selection for FSA based konkani verb morphologic al analyzer. In Proceedings of the COLING 10-14 (Dec, 2012).
[5]
Edna Vaz, Shantaram V. Walawalikar, Jyoti Pawar, and Madhavi Sardesai. 2012. BIS annotation standards with reference to konkani language. In Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing. pages 145–152. COLING 2012, Mumbai, December.
[6]
Diksha N. Prabhu Khorjuvenkar, Megha Ainapurkar, and Asst Prof. Sufola Chagas. 2018. PARTS OF SPEECH TAGGING FOR KONKANI LANGUAGE. In Proceedings of the 2nd International Conference on Computing Methodologies and Communication.
[7]
Shantaram Walawalikar, Shilpa Desai, Ramdas Karmali, Sushant Naik, Damodar Ghanekar, Chandralekha D'Souza, and Jyoti Pawar. 2010. Experiences in building the konkani wordnet using the expansion approach. In Proceedings of the 5th Global WordNet Conference on Principles, Construction and Application of Multilingual WordNets (Mumbai-India), 2010.
[8]
Venkatesh Prabhu, Shilpa Desai, Hanumant Redkar, Neha Prabhugaonkar, Apurva Nagvenkar, and Ramdas Karmali. An efficient database design for indowordnet development using hybrid approach. In Proceedings of the COLING 2012, Mumbai, India. (229).
[9]
Shilpa Desai, Ramdas Karmali, Sushant Naik, Shantaram Walawalikar, and Damodar Ghanekar. 2010. Tools for IndoWordNet Development. In Proceedings of the International Conference on Natural Language Processing.
[10]
Neha R. Prabhugaonkar, Apurva S. Nagvenkar, and Ramdas N. Karmali. 2012. IndoWordNet application programming interfaces. COLING, Mumbai, India, 237–244.
[11]
Pushpak Bhattacharyya. 2010. IndoWordnet. In Proceedings of LREC-10, Valletta, Malta. European Language Resources Association (ELRA).
[12]
IndoWordNet available at: Retrieved from http://www.cfilt.iitb.ac.in/indowordnet/, [Accessed July 03, 2020].
[13]
Diptesh Kanojia, Kevin Patel, and Pushpak Bhattacharyya. 2019. Indian language wordnets and their linkages with princeton wordnet. In Proceedings of the 11th International Conference on Language Resources and Evaluation [Online]. Available: Retrieved from https://www.aclweb.org/anthology/L18-1728.pdf, [Accessed Oct. 17, 2019].
[14]
Anna Sinopalnikova. Word Association Thesaurus As a Resource for Building WordNet. In Proceedings of 2nd International WordNet Conference, Brno. 199–205.
[15]
Maria Ruiz-Casado, Enrique Alfonseca, and Pablo Castells. Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets. In Proceedings of Advances in Web Intelligence, Lodz, Poland.
[16]
Aobo Wang, Cong Duy Vu Hoang, and Min-Yen Kan. 2013. Perspectives on crowdsourcing annotations for natural language processing. Language Resources and Evaluation 47, 1 (2013), 9–31.
[17]
D. A. Ustalov. 2015. A crowdsourcing engine for mechanized labor. Proceedings of the Institute for System Programming 27, 3 (2015), 351–364.
[18]
Amazon Mechanical Turk. Available: Retrieved from https://www.mturk.com/worker/help. [Accessed March 31, 2020].
[19]
Ido Guy, Anat Hashavit, and Yaniv Corem. Games for crowds: A crowdsourcing game platform for the enterprise. In Proceedings of the ACM Conference on Computer Supported Cooperative Work & Social Computing (Vancouver, Canada).
[20]
Marta Sabou, Kalina Bontcheva, Leon Derczynski, and Arno Scharl. 2014. Corpus annotation through crowdsourcing: Towards best practice guidelines. In Proceedings of the 9th International Conference On Language Resources And Evaluation. 859–866.
[21]
Christopher G. Harris, and Padmini Srinivasan. 2013. The employment of crowdsourcing workers for tasks that violate privacy and ethics. Security and Privacy in Social Networks 2013. 67–83.
[22]
Alexis Fournier. 6 Great Advantages of Crowdsourcing you can Benefit From, available at: Retrieved from https://www.braineet.com/blog/crowdsourcing-benefits/. [Accessed July 4, 2020].
[23]
Marshall Hargrave, Crowdsourcing. Retrieved from https://www.investopedia.com/terms/c/crowdsourcing.asp, [Accessed July 3, 2020]
[24]
Marcia Yudkin. Crowdsourcing: 9 Hidden pitfalls of this new method of generating your new business name. Retrieved from https://www.yudkin.com/crowdsourcing.htm, [Accessed 2020].
[25]
Matt Post, Chris Callison-Burch, Miles Osborne. Constructing parallel corpora for six indian languages via crowdsourcing. In Proceedings of the 7th Workshop on Statistical Machine Translation. 401–409, Montreal, Canada, June. Association for Computational Linguistics.
[26]
Chris Biemann. 2013. Creating a system for lexical substitutions from scratch using crowdsourcing. Language Resources and Evaluation: Special Issue on Collaboratively Constructed Language Resources. 47, 1 (2013, March), 97–122.
[27]
Chris Biemann and Valerie Nygaard. 2010. CrowdsourcingWordNet. In Proceedings of the 5th International Conference of the Global WordNet Association.
[28]
Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the ACM SIGMOD International Conference on Management of data. 61–72, June.
[29]
Aniket Kittur, Boris Smus, Susheel Khamkar, and Robert E. Kraut. 2011. CrowdForge: Crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on User interface software and technology, Santa Barbara, CA. 43–52, (October 2011).
[30]
Anoop Kunchukuttan, Shourya Roy, Pratik Patel, Kushal Ladha, Somya Gupta, Mitesh Khapra, and Pushpak Bhattacharyya. 2012. Experiences in resource generation for machine translation through crowdsourcing. In Proceedings of the International Conference on Language Resources and Evaluation. 384–391, 2012.
[31]
Andreas Hotho, Andreas Nurnberger, and Gerhard Paa. 2005. A brief survey of text mining. Ldv Forum. 20. 1 (2005), 19–62.
[32]
Alabhya Farkiya, Prashant Saini, Shubham Sinha, and Sharmishta Desai. 2015. Natural language processing using NLTK and WordNet. (IJCSIT) International Journal of Computer Science and Information Technologies 6, 6 (2015), 5465–546981.
[33]
Konkani POS Tagger. Retrieved from http://annierajan.com/intag/. [Accessed March 31, 2020].
[34]
M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut. 2017. A brief survey of text mining: Classification, clustering and extraction techniques. arXiv:1707.02919. Retrieved from https://arxiv.org/abs/1707.02919.
[35]
Tamanna Siddiqui and Parvej Aalam. 2015. Short text clustering; challenges & solutions: A literature review. International Journal of Mathematics and Computer Research 3, 6 (2015, June), 1025–1031.
[36]
Samah Fodeh, Bill Punch, and Pang-Ning Tan. 2011. On ontology-driven document clustering using core semantic features. Knowledge and Information Systems 28, 2 (2011), 395–421.
[37]
Xiaodan Zhang, Liping Jing, Xiaohua Hu, Michael Ng, Jiali Xia, and Xiaohua Zhou. 2008. Medical document clustering using ontology-based term similarity measures. International Journal of Data Warehousing and Mining 4, 1 (2008), 62-73.
[38]
Tingting Wei, Yonghe Lu, Huiyou Chang, Qiang Zhou, and Xianyu Bao. 2015. A semantic approach for text clustering using WordNet and lexical chains. Expert Systems with Applications 42, 4 (2015, March), 2264–2275.
[39]
Wei Song, Cheng Hua Li, and Soon Cheol Park. 2009. Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures. Expert Systems with Applications 36, 5 (2009), 9095–9104.
[40]
Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta. 2007. Clustering short texts using wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 787–788, July.
[41]
Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. M. M. Veloso (Ed.), In Proceedings of the 20th International Joint Conference on Artificial Intelligence. 1606–1611.
[42]
Liping Jing, Michael K. Ng, and Joshua Z. Huang. 2010. Knowledge-based vector space model for text clustering. Knowledge and Information Systems 25, 1 (2010), 35–55.
[43]
Xia Hu, Nan Sun, Chao Zhang, and Tat-Seng Chua. 2009. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 919–928.
[44]
Vivek Kumar Singh, Nisha Tiwari, and Shekhar Garg. 2011. Document clustering using K-means, Heuristic K-means and Fuzzy C-means. In Proceedings of the International Conference on Computational Intelligence and Communication Systems. 287–301.
[45]
Harmandeep Kaur and Munish Kumar. 2018. A comprehensive survey on word recognition for non-Indic and Indic scripts. Pattern Analysis and Applications 21, 4 (2018), 897–929.
[46]
Munish Kumar, M. K. Jindal, R. K. Sharma, and Simpel Rani Jindal. 2019. Character and numeral recognition for non-Indic and Indic scripts: a survey. Artificial Intelligence Review 52, 4 (2019), 2235–2261.
[47]
Shaveta Dargan, Munish Kumar, Maruthi Rohit Ayyagari, and Gulshan Kumar. 2019. A survey of deep learning and its applications: A new paradigm to machine learning. Archives of Computational Methods in Engineering. 27, 4 (2019), 1--22.
[48]
Harmandeep Kaur and Munish Kumar. 2021. Offline handwritten Gurumukhi word recognition using eXtreme Gradient Boosting methodology. Soft Computing 25, 6 (2021), 4451–4464.
[49]
Harmandeep Kaur and Munish Kumar. 2021. On the recognition of offline handwritten word using holistic approach and AdaBoost methodology. Multimedia Tools and Applications 80, 7 (2021), 11155–11175.
[50]
S. R. Narang, M. K. Jindal, S. Ahuja, and M. Kumar. 2020. On the recognition of Devanagari ancient handwritten characters using SIFT and Gabor features. Soft Computing 24, 22 (2020), 17279–17289.
[51]
Munish Kumar and Simpel Rani Jindal. 2020. A study on recognition of pre-segmented handwritten multi-lingual characters. Archives of Computational Methods in Engineering 27, 2 (2020), 577–589.

Cited By

View all
  • (2023)English to Konkani Translator Using Hindi as a Pivot Language2023 International Conference on Recent Advances in Information Technology for Sustainable Development (ICRAIS)10.1109/ICRAIS59684.2023.10367083(160-165)Online publication date: 6-Nov-2023

Index Terms

  1. Konkani WordNet: Corpus-Based Enhancement using Crowdsourcing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 4
    July 2022
    464 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3511099
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 March 2022
    Accepted: 01 November 2021
    Revised: 01 October 2021
    Received: 01 September 2020
    Published in TALLIP Volume 21, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. WordNet
    2. Konkani Wordnet
    3. Konkani Shabdarth
    4. crowdsourcing

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)45
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)English to Konkani Translator Using Hindi as a Pivot Language2023 International Conference on Recent Advances in Information Technology for Sustainable Development (ICRAIS)10.1109/ICRAIS59684.2023.10367083(160-165)Online publication date: 6-Nov-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media