KerMinSVM for imbalanced datasets with a case study on arabic comics classification

doi:10.1016/j.engappai.2017.01.001

Engineering Applications of Artificial Intelligence

Volume 59, March 2017, Pages 159-169

https://doi.org/10.1016/j.engappai.2017.01.001 Get rights and content

Abstract

Many studies have been performed to classify large-sized text documents using different classifiers, ranging from simple distance classifiers such as K-Nearest-Neighbor (KNN) to more advanced classifiers such as Support Vector Machines. Traditional approaches fail when a short text is encountered due to sparsity resulting from a limited number of words. Another common problem in text classification is class imbalance (CI). CI occurs when one class of the data contains most of the samples while the other class contains only a few. Standard classifiers, when applied to imbalanced data, result in high accuracy for the majority class and low accuracy for the minority one. We were motivated to propose a novel framework for classifying the content of Arabic comics; therefore, we propose KerMinSVM, a kernel extension of our previously proposed MinSVM coupled with a new dimensionality featuring a reduction scheme based on word root frequency ratios (WRFR). KerMinSVM was tested on multiple imbalanced benchmark datasets, and the results were verified using three measures: accuracy, F-measure, and statistical analysis. WRFR was applied to the manual construction of the Arabic comic text dataset to detect strong content in children's comic books. Test results revealed that our proposed framework outperformed most of the methods for imbalanced datasets and short text classification.

Introduction

Due to the rapid development and spread of the internet, different types of short texts have been produced, such as web search snippets, chat messages, comments, status updates, tweets, news feeds, books, movie synopses and reviews. Classifying short text is of great importance for several purposes and applications, such as filtering offensive comments or assessing the satisfaction of customers with a certain product. Another example of short text is found in comic books. This text is usually unstructured and takes the form of brief conversations, consisting of multiple short sentences. Since short texts tend to have a sparse feature vector and exhibit class imbalance (CI), they cannot be classified with good accuracy using standard techniques.

An imbalanced dataset is one in which the different classification categories are not equally represented. A class that comprises many samples is referred to as the ‘majority class’ and conversely a class that contains very few samples is known as the ‘minority class’. As stated before, when performing classification on an imbalanced dataset, the classifier tends to achieve a high level of accuracy for the majority class, but low accuracy for the minority ones. This is because most of the classification algorithms focus on maximizing overall accuracy, without taking into consideration the accuracy of each class. In imbalanced datasets, the impact of minority samples is more pronounced than the majority samples. Misclassifying these minority samples will inevitably result in misleading and inaccurate information and hence undermine the aims of the application (Awad and Khanna, 2015).

Comics are usually popular amongst children. However, a number of these comics include strong content, such as conflicts, war, weaponry, and martyrdom, which are topics unsuitable for a younger audience. In general, the number of comics that contains strong material is very small with respect to the comics that are suitable for children. Nevertheless, detecting strong content in comic books is crucial. In this direction, we propose a new framework to classify short Arabic texts. Taking into consideration the root base nature of the Arabic language, we reduce the sparsity and dimensionality of the feature vector without adding external information to the original data. This methodology allows the roots of words to be used as features. It groups words of same root in one feature, and consequently reduces data dimensionality. To reduce the feature vector length even further, roots are grouped together based on their semantic similarities. To test the feature reduction technique, a dataset of Arabic comic text is manually constructed and annotated. Grouping similar roots together gave better representation of the constructed dataset and reduced the sparsity of the feature vector. To improve the classification accuracy of Support Vector Machines (SVM) on imbalanced data, a kernel extension to the Minority Support Vector Machine (MinSVM) classifier we proposed earlier (Ajeeb et al., 2013) has been developed and tested.

The remainder of this paper includes, under Section II, the literature review for the techniques used to improve imbalanced data classification as well as previous work on Arabic text and on short text classifications. Section III presents the proposed framework containing KerMinSVM for imbalanced data classification and WRFR, the new feature extraction approach for short Arabic text. Experimental results are presented in Section IV, followed by concluding remarks in Section V.

Section snippets

Data imbalance

Little SVM research has investigated improving classification for imbalanced datasets. One approach is to resample the dataset to achieve class balance. This is performed by either under-sampling the majority class or over-sampling the minority class. Another technique is to modify the SVM algorithm to overcome the data imbalance. Finally, hybrid methods are designed to benefit from the advantages of both mentioned approaches.

Proposed framework

The proposed framework consists of a text processing stage and a classifier. Fig. 2 shows the workflow of the methodology. First, it is necessary to extract the text from the comic books. The data used in this study was in PDF format, and because there were no available tools to extract the texts automatically, this step was performed manually. The extracted text was saved in UTF-8 text files. The next step was the text processing, where the raw text files are converted to feature vector

KerMinSVM benchmark testing

This section evaluates the performance of KerMinSVM and compares it with the performances of the original SVM implementation, an SVM with different cost functions (CSVM), an SVM after applying SMOTE (SMOTE-SVM) and finally with an SVM after applying RUS on the data (RUS-SVM). Non-SVM approaches such as KNN and Tree-Fitting were also evaluated to highlight a distinction between different types of classification methods. For these tests, we chose 16 datasets with different Imbalance Ratios (IR),

Conclusion

In this paper, we introduced the KerMinSVM classifier, which is a modification of the original SVM and designed to solve the problem of learning imbalanced datasets. As shown in the experimental section, KerMinSVM outperformed other techniques for leaning an imbalanced dataset. KerMinSVM has a higher sensitivity and F-Measure than the normal SVM and the other techniques, and it does not sacrifice the specificity of the data. Moreover, KerMinSVM is computationally efficient, since it does not

Acknowledgments

This work is partly supported by the Qatar National Research Foundation (QNRF) and partly by the University Research Board at the American University of Beirut. The authors would like to thank Yara Rizk, Wissam Marrouche and Mohamad Kamareddine for their help in formatting the paper.

References (44)

Yanmin Sun
Cost-sensitive boosting for classification of imbalanced data
Pattern Recognit.
(2007)
Abdulla, N.. et al. 2013. Arabic sentiment analysis: Corpus-based and lexicon-based. Proceedings of The IEEE conference...
Ajeeb, N., Nayal, A., Awad, M., 2013. "Minority SVM for linearly separable imbalanced datasets, In: Proceedings of the...
R. Akbani et al.
Applying support vector machines to imbalanced datasets
Mach. Learn. ECML 2004
(2004)
J. Alcalá
Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework.
J. Mult.-Value. Log. Soft Comput. 17. 2-3
(2010)
Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Al-Rajeh, M., Khorsheed, A., 2008. Automatic Arabic text...
Awad, M., Khanna, R., 2015. Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and System...
Batuwita, R., Palade, V., 2010. FSVM-CIL: fuzzy support vector machines for class imbalance learning, in: IEEE...
Bollegala, D., Matsuo, Y., Ishizuka, M., 2007. Measuring semantic similarity between words using web search engines,...
Charalampopoulos, I., Anagnostopoulos, I., 2011. A comparable study employing weka clustering/classification algorithms...

N.V. Chawla et al.

SMOTE: synthetic minority over-sampling technique

arXiv Prepr. arXiv:1106. 1813

(2011)

Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W., 2003. SMOTEBoost: Improving prediction of the minority class in...

Chen, M., Jin, X., Shen, D., 2011. "Short text classification improved by learning multi-granularity topics," In:...

David A. Cieslak

Hellinger distance decision trees are robust and skew-insensitive

Data Min. Knowl. Discov.

(2012)

A. El-Halees

Arabic text classification using maximum entropy

Islam. Univ. J. (Ser. Nat. Stud. Eng.)

(2007)

El Kourdi, M., Bensaid, A., Rachidi, T., 2004. "Automatic Arabic document categorization based on the Naive Bayes...

Faguo, Z., Fan, Z., Bingru, Y., Xingang, Y., 2010. Research on short text classification algorithm based on statistics...

Gazzah, S., Amara, N., 2008. New oversampling approaches based on polynomial fitting for imbalanced data sets. In:...

Holmes, G., Donkin, A., Witten, I., 1994. Weka: A machine learning workbench. Intelligent Information Systems, 1994....

Hu, X., Sun, N., Zhang, C., Chua, T., 2009. "Exploiting internal and external semantics for the clustering of short...

T. Imam et al.

z-svm: an svm for improved classification of imbalanced data

AI 2006: Adv. Artif. Intell.

(2006)

Jomaa*, H., Kamereddine*, M., Nayal*, A., Rizk*, Y., Awad, M., 2016. Affective Relationship Between Color and Text in...

Cited by (9)

AraCovTexFinder: Leveraging the transformer-based language model for Arabic COVID-19 text identification
2024, Engineering Applications of Artificial Intelligence
In light of the pandemic, the identification and processing of COVID-19-related text have emerged as critical research areas within the field of Natural Language Processing (NLP). With a growing reliance on online portals and social media for information exchange and interaction, a surge in online textual content, comprising disinformation, misinformation, fake news, and rumors has led to the phenomenon of an infodemic on the World Wide Web. Arabic, spoken by over 420 million people worldwide, stands as a significant low-resource language, lacking efficient tools or applications for the detection of COVID-19-related text. Additionally, the identification of COVID-19 text is an essential prerequisite task for detecting fake and toxic content associated with COVID-19. This gap hampers crucial COVID information retrieval and processing necessary for policymakers and health authorities. Addressing this issue, this paper introduces an intelligent Arabic COVID-19 text identification system named ‘AraCovTexFinder,’ leveraging a fine-tuned fusion-based transformer model. Recognizing the challenges posed by a scarcity of related text corpora, substantial morphological variations in the language, and a deficiency of well-tuned hyperparameters, the proposed system aims to mitigate these hurdles. To support the proposed method, two corpora are developed: an Arabic embedding corpus (AraEC) and an Arabic COVID-19 text identification corpus (AraCoV). The study evaluates the performance of six transformer-based language models (mBERT, XML-RoBERTa, mDeBERTa-V3, mDistilBERT, BERT-Arabic, and AraBERT), 12 deep learning models (combining Word2Vec, GloVe, and FastText embedding with CNN, LSTM, VDCNN, and BiLSTM), and the newly introduced model AraCovTexFinder. Through extensive evaluation, AraCovTexFinder achieves a high accuracy of 98.89 ± 0.001%, outperforming other baseline models, including transformer-based language and deep learning models. This research highlights the importance of specialized tools in low-resource languages to combat the infodemic relating to COVID-19, which can assist policymakers and health authorities in making informed decisions.
Addressing the issue of digital mapping of soil classes with imbalanced class observations
2019, Geoderma
Citation Excerpt :
For example, it is hard to make sure that all classes are included in both calibration and validation datasets without omission of the minority class or classes. While imbalanced classification is a recognized problem in the machine learning discipline for categorical data modeling (Haixiang et al., 2017; Nayal et al., 2017) this issue has not been well addressed in soil mapping. In DSM, a lot of effort has been made to compare different machine learning models to seek out the most accurate or optimal model configuration (Brungard et al., 2015; Heung et al., 2016; Taghizadeh-Mehrjardi et al., 2015).
Considering the nature of soils distribution, an important modeling issue in soil class mapping is imbalanced class observations. Imbalanced number of data in observed soil classes in an area can result in the underestimation or loss of minority classes and an overestimation of the majority classes in predictive modeling. The effect of this phenomenon is that an area of land with comparatively fewer soil profile observations could be unmapped in the digital maps. To address this problem, this paper investigated the usefulness of data pretreatment techniques called over- and under-sampling of data applied on three predictive models including decision trees (DT), random forest (RF), and multinomial logistic regression (MNLR). The study area is situated in the northwest of Iran with 452 profiles observations on a regular grid covering about 12,000 ha. This area has 8 USDA soil great groups with an imbalanced frequency distribution. Results showed that modeling using imbalanced distribution of class observation caused uncertain maps with minority classes being lost and relatively poor accuracies. After data treatment, with over- and under-sampling, all models showed significant improvement in maintaining the minority classes, in both calibration and validation evaluations. Balancing the classes led to a notable decrease in uncertainty of all 3 models by decreasing the confusion index and raising the probability of occurrence for the soil classes in the final maps. Comparing the 3 models, decision trees showed the largest calibration and validation accuracies with and without data treatment. RF has an issue of overestimation of some of the majority classes. Data resampling technique can be a useful solution for dealing with imbalanced class observations to produce more certain digital soil maps.
Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique
2019, Computers and Electronics in Agriculture
Citation Excerpt :
Most machine learning models have been reported to overestimate classes with a comparatively higher number of observations and underestimate classes with fewer observations (Grunwald, 2009). The models also tend to maximize overall accuracy and thus provide a smoothed map (Goovaerts, 1997; 2000; Nayal et al., 2017). These regression models do not take into account the proximity and neighborhood of the observations.
Class imbalance is a problem in spatial predictive models; it occurs when classes with a large number of observations dominate the prediction and classes with much less number of observations are not predicted at all. It is a crucial topic as imbalanced distribution of soil classes occurs naturally. In this study, we address this problem using Markov chain random field modeling combined with a data resampling technique, i.e., random oversampling. The study area is about 12,000 ha with imbalanced distribution of eight soil great groups located in the northwest of Iran. Four algorithms were used to interpolate class observations, including fixed-path and random-path Markov chain random fields (FPth and RPth, respectively), indicator kriging simulation (IKS) and Markovian-type categorical simulation (MCS). All of the models provide prediction and simulation as outputs. The random oversampling technique was applied to the minority soil classes prior to modeling. Using original data that have imbalanced classes for mapping resulted in loss of the minority classes and relatively low Kappa agreement values for some models. Data oversampling increased overall accuracy and Kappa coefficient for FPth and RPth models. In addition, it led to the maintenance of one or two minority classes in the resulting maps. Overall, RPth model showed the highest overall accuracy and Kappa coefficient both in prediction (overall = 50%, K = 0.33) and simulation (overall = 52%, K = 0.36) outputs after data oversampling. Markov chain models combined with data oversampling technique can be used for mapping imbalanced soil classes.
A binary PSO-based ensemble under-sampling model for rebalancing imbalanced training data
2022, Journal of Supercomputing
Classification of Qur'anic topics based on imbalanced classification
2021, Indonesian Journal of Electrical Engineering and Computer Science
Classification of Quranic topics based on imbalanced classification
2020, Indonesian Journal of Electrical Engineering and Computer Science

View all citing articles on Scopus

View full text