KerMinSVM for imbalanced datasets with a case study on arabic comics classification

https://doi.org/10.1016/j.engappai.2017.01.001Get rights and content

Abstract

Many studies have been performed to classify large-sized text documents using different classifiers, ranging from simple distance classifiers such as K-Nearest-Neighbor (KNN) to more advanced classifiers such as Support Vector Machines. Traditional approaches fail when a short text is encountered due to sparsity resulting from a limited number of words. Another common problem in text classification is class imbalance (CI). CI occurs when one class of the data contains most of the samples while the other class contains only a few. Standard classifiers, when applied to imbalanced data, result in high accuracy for the majority class and low accuracy for the minority one. We were motivated to propose a novel framework for classifying the content of Arabic comics; therefore, we propose KerMinSVM, a kernel extension of our previously proposed MinSVM coupled with a new dimensionality featuring a reduction scheme based on word root frequency ratios (WRFR). KerMinSVM was tested on multiple imbalanced benchmark datasets, and the results were verified using three measures: accuracy, F-measure, and statistical analysis. WRFR was applied to the manual construction of the Arabic comic text dataset to detect strong content in children's comic books. Test results revealed that our proposed framework outperformed most of the methods for imbalanced datasets and short text classification.

Introduction

Due to the rapid development and spread of the internet, different types of short texts have been produced, such as web search snippets, chat messages, comments, status updates, tweets, news feeds, books, movie synopses and reviews. Classifying short text is of great importance for several purposes and applications, such as filtering offensive comments or assessing the satisfaction of customers with a certain product. Another example of short text is found in comic books. This text is usually unstructured and takes the form of brief conversations, consisting of multiple short sentences. Since short texts tend to have a sparse feature vector and exhibit class imbalance (CI), they cannot be classified with good accuracy using standard techniques.

An imbalanced dataset is one in which the different classification categories are not equally represented. A class that comprises many samples is referred to as the ‘majority class’ and conversely a class that contains very few samples is known as the ‘minority class’. As stated before, when performing classification on an imbalanced dataset, the classifier tends to achieve a high level of accuracy for the majority class, but low accuracy for the minority ones. This is because most of the classification algorithms focus on maximizing overall accuracy, without taking into consideration the accuracy of each class. In imbalanced datasets, the impact of minority samples is more pronounced than the majority samples. Misclassifying these minority samples will inevitably result in misleading and inaccurate information and hence undermine the aims of the application (Awad and Khanna, 2015).

Comics are usually popular amongst children. However, a number of these comics include strong content, such as conflicts, war, weaponry, and martyrdom, which are topics unsuitable for a younger audience. In general, the number of comics that contains strong material is very small with respect to the comics that are suitable for children. Nevertheless, detecting strong content in comic books is crucial. In this direction, we propose a new framework to classify short Arabic texts. Taking into consideration the root base nature of the Arabic language, we reduce the sparsity and dimensionality of the feature vector without adding external information to the original data. This methodology allows the roots of words to be used as features. It groups words of same root in one feature, and consequently reduces data dimensionality. To reduce the feature vector length even further, roots are grouped together based on their semantic similarities. To test the feature reduction technique, a dataset of Arabic comic text is manually constructed and annotated. Grouping similar roots together gave better representation of the constructed dataset and reduced the sparsity of the feature vector. To improve the classification accuracy of Support Vector Machines (SVM) on imbalanced data, a kernel extension to the Minority Support Vector Machine (MinSVM) classifier we proposed earlier (Ajeeb et al., 2013) has been developed and tested.

The remainder of this paper includes, under Section II, the literature review for the techniques used to improve imbalanced data classification as well as previous work on Arabic text and on short text classifications. Section III presents the proposed framework containing KerMinSVM for imbalanced data classification and WRFR, the new feature extraction approach for short Arabic text. Experimental results are presented in Section IV, followed by concluding remarks in Section V.

Section snippets

Data imbalance

Little SVM research has investigated improving classification for imbalanced datasets. One approach is to resample the dataset to achieve class balance. This is performed by either under-sampling the majority class or over-sampling the minority class. Another technique is to modify the SVM algorithm to overcome the data imbalance. Finally, hybrid methods are designed to benefit from the advantages of both mentioned approaches.

Proposed framework

The proposed framework consists of a text processing stage and a classifier. Fig. 2 shows the workflow of the methodology. First, it is necessary to extract the text from the comic books. The data used in this study was in PDF format, and because there were no available tools to extract the texts automatically, this step was performed manually. The extracted text was saved in UTF-8 text files. The next step was the text processing, where the raw text files are converted to feature vector

KerMinSVM benchmark testing

This section evaluates the performance of KerMinSVM and compares it with the performances of the original SVM implementation, an SVM with different cost functions (CSVM), an SVM after applying SMOTE (SMOTE-SVM) and finally with an SVM after applying RUS on the data (RUS-SVM). Non-SVM approaches such as KNN and Tree-Fitting were also evaluated to highlight a distinction between different types of classification methods. For these tests, we chose 16 datasets with different Imbalance Ratios (IR),

Conclusion

In this paper, we introduced the KerMinSVM classifier, which is a modification of the original SVM and designed to solve the problem of learning imbalanced datasets. As shown in the experimental section, KerMinSVM outperformed other techniques for leaning an imbalanced dataset. KerMinSVM has a higher sensitivity and F-Measure than the normal SVM and the other techniques, and it does not sacrifice the specificity of the data. Moreover, KerMinSVM is computationally efficient, since it does not

Acknowledgments

This work is partly supported by the Qatar National Research Foundation (QNRF) and partly by the University Research Board at the American University of Beirut. The authors would like to thank Yara Rizk, Wissam Marrouche and Mohamad Kamareddine for their help in formatting the paper.

References (44)

  • Yanmin Sun

    Cost-sensitive boosting for classification of imbalanced data

    Pattern Recognit.

    (2007)
  • Abdulla, N.. et al. 2013. Arabic sentiment analysis: Corpus-based and lexicon-based. Proceedings of The IEEE conference...
  • Ajeeb, N., Nayal, A., Awad, M., 2013. "Minority SVM for linearly separable imbalanced datasets, In: Proceedings of the...
  • R. Akbani et al.

    Applying support vector machines to imbalanced datasets

    Mach. Learn. ECML 2004

    (2004)
  • J. Alcalá

    Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework.

    J. Mult.-Value. Log. Soft Comput. 17. 2-3

    (2010)
  • Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Al-Rajeh, M., Khorsheed, A., 2008. Automatic Arabic text...
  • Awad, M., Khanna, R., 2015. Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and System...
  • Batuwita, R., Palade, V., 2010. FSVM-CIL: fuzzy support vector machines for class imbalance learning, in: IEEE...
  • Bollegala, D., Matsuo, Y., Ishizuka, M., 2007. Measuring semantic similarity between words using web search engines,...
  • Charalampopoulos, I., Anagnostopoulos, I., 2011. A comparable study employing weka clustering/classification algorithms...
  • N.V. Chawla et al.

    SMOTE: synthetic minority over-sampling technique

    arXiv Prepr. arXiv:1106. 1813

    (2011)
  • Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W., 2003. SMOTEBoost: Improving prediction of the minority class in...
  • Chen, M., Jin, X., Shen, D., 2011. "Short text classification improved by learning multi-granularity topics," In:...
  • David A. Cieslak

    Hellinger distance decision trees are robust and skew-insensitive

    Data Min. Knowl. Discov.

    (2012)
  • A. El-Halees

    Arabic text classification using maximum entropy

    Islam. Univ. J. (Ser. Nat. Stud. Eng.)

    (2007)
  • El Kourdi, M., Bensaid, A., Rachidi, T., 2004. "Automatic Arabic document categorization based on the Naive Bayes...
  • Faguo, Z., Fan, Z., Bingru, Y., Xingang, Y., 2010. Research on short text classification algorithm based on statistics...
  • Gazzah, S., Amara, N., 2008. New oversampling approaches based on polynomial fitting for imbalanced data sets. In:...
  • Holmes, G., Donkin, A., Witten, I., 1994. Weka: A machine learning workbench. Intelligent Information Systems, 1994....
  • Hu, X., Sun, N., Zhang, C., Chua, T., 2009. "Exploiting internal and external semantics for the clustering of short...
  • T. Imam et al.

    z-svm: an svm for improved classification of imbalanced data

    AI 2006: Adv. Artif. Intell.

    (2006)
  • Jomaa*, H., Kamereddine*, M., Nayal*, A., Rizk*, Y., Awad, M., 2016. Affective Relationship Between Color and Text in...
  • Cited by (9)

    • Addressing the issue of digital mapping of soil classes with imbalanced class observations

      2019, Geoderma
      Citation Excerpt :

      For example, it is hard to make sure that all classes are included in both calibration and validation datasets without omission of the minority class or classes. While imbalanced classification is a recognized problem in the machine learning discipline for categorical data modeling (Haixiang et al., 2017; Nayal et al., 2017) this issue has not been well addressed in soil mapping. In DSM, a lot of effort has been made to compare different machine learning models to seek out the most accurate or optimal model configuration (Brungard et al., 2015; Heung et al., 2016; Taghizadeh-Mehrjardi et al., 2015).

    • Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique

      2019, Computers and Electronics in Agriculture
      Citation Excerpt :

      Most machine learning models have been reported to overestimate classes with a comparatively higher number of observations and underestimate classes with fewer observations (Grunwald, 2009). The models also tend to maximize overall accuracy and thus provide a smoothed map (Goovaerts, 1997; 2000; Nayal et al., 2017). These regression models do not take into account the proximity and neighborhood of the observations.

    • Classification of Qur'anic topics based on imbalanced classification

      2021, Indonesian Journal of Electrical Engineering and Computer Science
    • Classification of Quranic topics based on imbalanced classification

      2020, Indonesian Journal of Electrical Engineering and Computer Science
    View all citing articles on Scopus
    View full text