ABSTRACT
Every language has its own root, form, and grammar, and so does Bengali. Bengali language has two core forms: "Sadhu-bhasha" and "Cholito-bhasha" which have been widely used from regular communication to literary publications. At present, Sadhu-bhasha can be only found in old books and literary publications, whereas Cholito-bhasha is mostly used everywhere. However, so many Bengali linguists are still researching on these two forms to preserve its root, understand and develop Bengali, and also extract knowledge from the historical publications which were mainly written in Sadhu-bhasha. Unfortunately, till now they do not have any digital tool that can assist their research by automatically identifying these core forms of Bengali from the large archive of Bengali literature. This study aims to build such an automatic intelligent system that can accurately identify these two language forms by harnessing the power of Natural Language Processing (NLP). In this study, we have applied advanced NLP techniques and six Supervised learning algorithms to classify "Sadhu-bhasha" and "Cholito-bhasha" from text corpora. Results of this study show that all the six models yielded very promising results, however, the Multinomial Naive Bayes outperformed all the models with 99.5% accuracy, 99.0% precision, 100% recall, 0.995 AUC score and, 0.995 F1 score. Additionally, this study also performs qualitative analysis using t-SNE algorithm to visualize the difference between Sadhu-bhasha and Cholito-bhasha.
- Efat, Md.Iftekharul Alam, et al. "Automated Bangla Text Summarization by Sentence Scoring and Ranking." 2013 International Conference on Informatics, Electronics and Vision (ICIEV), IEEE, 2013. Crossref, doi:10.1109/iciev.2013.6572686.Google ScholarCross Ref
- Paul, Anirudha, et al. "Bangla News Summarization." Computational Collective Intelligence, Springer International Publishing, 2017, pp. 479--88. Crossref, doi:10.1007/978-3-319-67077-5_46.Google ScholarDigital Library
- Sarkar, Kamal. "Bengali text summarization by sentence extraction." arXiv preprint arXiv:1201.2240 (2012).Google Scholar
- Chy, Abu Nowshed, et al. "Bangla News Classification Using Naive Bayes Classifier." 16th Int'l Conf. Computer and Information Technology, IEEE, 2014. Crossref, doi:10.1109/iccitechn.2014.6997369.Google ScholarCross Ref
- Mansur, Munirul. Analysis of n-gram based text categorization for bangla in a newspaper corpus. Diss. BRAC University, 2006.Google Scholar
- Kabir, Fasihul, et al. "Bangla Text Document Categorization Using Stochastic Gradient Descent (SGD) Classifier." 2015 International Conference on Cognitive Computing and Information Processing (CCIP), IEEE, 2015. Crossref, doi:10.1109/ccip.2015.7100687.Google ScholarCross Ref
- Alam, Md.Habibul, et al. "Sentiment Analysis for Bangla Sentences Using Convolutional Neural Network." 2017 20th International Conference of Computer and Information Technology (ICCIT), IEEE, 2017. Crossref, doi:10.1109/iccitechn.2017.8281840.Google ScholarCross Ref
- Hassan, Asif, et al. "Sentiment Analysis on Bangla and Romanized Bangla Text Using Deep Recurrent Models." 2016 International Workshop on Computational Intelligence (IWCI), IEEE, 2016. Crossref, doi:10.1109/iwci.2016.7860338.Google ScholarCross Ref
- Chowdhury, Shaika, and Wasifa Chowdhury. "Performing Sentiment Analysis in Bangla Microblog Posts." 2014 International Conference on Informatics, Electronics & Vision (ICIEV), IEEE, 2014. Crossref, doi:10.1109/iciev.2014.6850712.Google ScholarCross Ref
- Python Core Team (2015). Python: A dynamic, open source programming language. Python Software Foundation. https://www.python.org/. Accessed 12 October 2018.Google Scholar
- Pedregosa F., Varoquaux, G.: Scikit-learn: Machine learning in Python. In: Journal of machine learning research, pp. 2825--2830 (2011).Google Scholar
- Breiman, Leo. "Random Forests." Machine Learning 45 (2001): 5--32.Google ScholarDigital Library
- Sreemathy, J R and Prof. S. Balamurugan. "An Efficient Text Classification Using Knn And Naive Bayesian." (2012).Google Scholar
- Quinlan, J. Ross. "Induction of decision trees." Machine Learning 1 (1986): 81--106.Google ScholarCross Ref
- Joachims, Thorsten. "Text Categorization with SVM: Learning with Many Relevant Fea-tures." (1998)Google Scholar
- Han, Eui-Hong Sam, George Karypis, and Vipin Kumar. "Text categorization using weight adjusted k-nearest neighbor classification." Pacific-asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg, 2001.Google Scholar
- Van der Maaten, Laurens, and Geoffrey Hinton. "Visualizing data using t-sne." Journal of machine learning research 9.Nov (2008)Google Scholar
Index Terms
- Incorporating Supervised Learning Algorithms with NLP Techniques to Classify Bengali Language Forms
Recommendations
Bengali paper classification using ensemble machine learning algorithms
Text classification is one of the most challenging problems in natural language processing (NLP). Language models are at the heart of NLP. The ability to represent texts as numbers has given rise to many NLP tasks, for example, text categorisation, ...
A survey on Urdu and Urdu like language stemmers and stemming techniques
Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected ...
MorphBen: A Neural Morphological Analyzer for Bengali Language
Computational Linguistics and Intelligent Text ProcessingAbstractRule-based systems based on two-level morphology for tagging the morphological features of a word work quite well for Bengali language and are able to predict all possible morphological derivations for standard forms of words whose roots occur in ...
Comments