research-article

Incorporating Supervised Learning Algorithms with NLP Techniques to Classify Bengali Language Forms

Authors:
Abdul Bari Parves

Computer Science and Engineering, Daffodil International University, Dhaka Bangladesh

Computer Science and Engineering, Daffodil International University, Dhaka Bangladesh
View Profile

,
Abdullah Al Imran

Computer Science and Engineering, American International University-Bangladesh, Dhaka Bangladesh

Computer Science and Engineering, American International University-Bangladesh, Dhaka Bangladesh
View Profile

,
Md. Riazur Rahman

Computer Science and Engineering, Daffodil International University, Dhaka Bangladesh

Computer Science and Engineering, Daffodil International University, Dhaka Bangladesh
View Profile

ICCA 2020: Proceedings of the International Conference on Computing AdvancementsJanuary 2020Article No.: 62Pages 1–7https://doi.org/10.1145/3377049.3377110

Published:20 March 2020Publication History

ICCA 2020: Proceedings of the International Conference on Computing Advancements

Pages 1–7

ABSTRACT

Every language has its own root, form, and grammar, and so does Bengali. Bengali language has two core forms: "Sadhu-bhasha" and "Cholito-bhasha" which have been widely used from regular communication to literary publications. At present, Sadhu-bhasha can be only found in old books and literary publications, whereas Cholito-bhasha is mostly used everywhere. However, so many Bengali linguists are still researching on these two forms to preserve its root, understand and develop Bengali, and also extract knowledge from the historical publications which were mainly written in Sadhu-bhasha. Unfortunately, till now they do not have any digital tool that can assist their research by automatically identifying these core forms of Bengali from the large archive of Bengali literature. This study aims to build such an automatic intelligent system that can accurately identify these two language forms by harnessing the power of Natural Language Processing (NLP). In this study, we have applied advanced NLP techniques and six Supervised learning algorithms to classify "Sadhu-bhasha" and "Cholito-bhasha" from text corpora. Results of this study show that all the six models yielded very promising results, however, the Multinomial Naive Bayes outperformed all the models with 99.5% accuracy, 99.0% precision, 100% recall, 0.995 AUC score and, 0.995 F1 score. Additionally, this study also performs qualitative analysis using t-SNE algorithm to visualize the difference between Sadhu-bhasha and Cholito-bhasha.

References

Efat, Md.Iftekharul Alam, et al. "Automated Bangla Text Summarization by Sentence Scoring and Ranking." 2013 International Conference on Informatics, Electronics and Vision (ICIEV), IEEE, 2013. Crossref, doi:10.1109/iciev.2013.6572686.Google ScholarCross Ref
Paul, Anirudha, et al. "Bangla News Summarization." Computational Collective Intelligence, Springer International Publishing, 2017, pp. 479--88. Crossref, doi:10.1007/978-3-319-67077-5_46.Google ScholarDigital Library
Sarkar, Kamal. "Bengali text summarization by sentence extraction." arXiv preprint arXiv:1201.2240 (2012).Google Scholar
Chy, Abu Nowshed, et al. "Bangla News Classification Using Naive Bayes Classifier." 16th Int'l Conf. Computer and Information Technology, IEEE, 2014. Crossref, doi:10.1109/iccitechn.2014.6997369.Google ScholarCross Ref
Mansur, Munirul. Analysis of n-gram based text categorization for bangla in a newspaper corpus. Diss. BRAC University, 2006.Google Scholar
Kabir, Fasihul, et al. "Bangla Text Document Categorization Using Stochastic Gradient Descent (SGD) Classifier." 2015 International Conference on Cognitive Computing and Information Processing (CCIP), IEEE, 2015. Crossref, doi:10.1109/ccip.2015.7100687.Google ScholarCross Ref
Alam, Md.Habibul, et al. "Sentiment Analysis for Bangla Sentences Using Convolutional Neural Network." 2017 20th International Conference of Computer and Information Technology (ICCIT), IEEE, 2017. Crossref, doi:10.1109/iccitechn.2017.8281840.Google ScholarCross Ref
Hassan, Asif, et al. "Sentiment Analysis on Bangla and Romanized Bangla Text Using Deep Recurrent Models." 2016 International Workshop on Computational Intelligence (IWCI), IEEE, 2016. Crossref, doi:10.1109/iwci.2016.7860338.Google ScholarCross Ref
Chowdhury, Shaika, and Wasifa Chowdhury. "Performing Sentiment Analysis in Bangla Microblog Posts." 2014 International Conference on Informatics, Electronics & Vision (ICIEV), IEEE, 2014. Crossref, doi:10.1109/iciev.2014.6850712.Google ScholarCross Ref
Python Core Team (2015). Python: A dynamic, open source programming language. Python Software Foundation. https://www.python.org/. Accessed 12 October 2018.Google Scholar
Pedregosa F., Varoquaux, G.: Scikit-learn: Machine learning in Python. In: Journal of machine learning research, pp. 2825--2830 (2011).Google Scholar
Breiman, Leo. "Random Forests." Machine Learning 45 (2001): 5--32.Google ScholarDigital Library
Sreemathy, J R and Prof. S. Balamurugan. "An Efficient Text Classification Using Knn And Naive Bayesian." (2012).Google Scholar
Quinlan, J. Ross. "Induction of decision trees." Machine Learning 1 (1986): 81--106.Google ScholarCross Ref
Joachims, Thorsten. "Text Categorization with SVM: Learning with Many Relevant Fea-tures." (1998)Google Scholar
Han, Eui-Hong Sam, George Karypis, and Vipin Kumar. "Text categorization using weight adjusted k-nearest neighbor classification." Pacific-asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg, 2001.Google Scholar
Van der Maaten, Laurens, and Geoffrey Hinton. "Visualizing data using t-sne." Journal of machine learning research 9.Nov (2008)Google Scholar

Index Terms

Incorporating Supervised Learning Algorithms with NLP Techniques to Classify Bengali Language Forms
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification

Recommendations

Bengali paper classification using ensemble machine learning algorithms

Text classification is one of the most challenging problems in natural language processing (NLP). Language models are at the heart of NLP. The ability to represent texts as numbers has given rise to many NLP tasks, for example, text categorisation, ...
Read More
A survey on Urdu and Urdu like language stemmers and stemming techniques

Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected ...
Read More
MorphBen: A Neural Morphological Analyzer for Bengali Language
Computational Linguistics and Intelligent Text Processing
Abstract
Rule-based systems based on two-level morphology for tagging the morphological features of a word work quite well for Bengali language and are able to predict all possible morphological derivations for standard forms of words whose roots occur in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICCA 2020: Proceedings of the International Conference on Computing Advancements
January 2020
517 pages
ISBN:9781450377782
DOI:10.1145/3377049

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 March 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bengali Text Classification
Natural Language Processing
Sadhu-bhasha and Cholito-bhasha Classification
Supervised Learning Algorithms
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 92
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Incorporating Supervised Learning Algorithms with NLP Techniques to Classify Bengali Language Forms

ICCA 2020: Proceedings of the International Conference on Computing Advancements

ABSTRACT

References

Cited By

Index Terms

Recommendations

Bengali paper classification using ensemble machine learning algorithms

A survey on Urdu and Urdu like language stemmers and stemming techniques

MorphBen: A Neural Morphological Analyzer for Bengali Language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Incorporating Supervised Learning Algorithms with NLP Techniques to Classify Bengali Language Forms

ICCA 2020: Proceedings of the International Conference on Computing Advancements

ABSTRACT

References

Cited By

Index Terms

Recommendations

Bengali paper classification using ensemble machine learning algorithms

A survey on Urdu and Urdu like language stemmers and stemming techniques

MorphBen: A Neural Morphological Analyzer for Bengali Language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media