Abstract
This paper presents the implementation of a new text document classification framework that uses the Support Vector Machine (SVM) approach in the training phase and the Euclidean distance function in the classification phase, coined as Euclidean-SVM. The SVM constructs a classifier by generating a decision surface, namely the optimal separating hyper-plane, to partition different categories of data points in the vector space. The concept of the optimal separating hyper-plane can be generalized for the non-linearly separable cases by introducing kernel functions to map the data points from the input space into a high dimensional feature space so that they could be separated by a linear hyper-plane. This characteristic causes the implementation of different kernel functions to have a high impact on the classification accuracy of the SVM. Other than the kernel functions, the value of soft margin parameter, C is another critical component in determining the performance of the SVM classifier. Hence, one of the critical problems of the conventional SVM classification framework is the necessity of determining the appropriate kernel function and the appropriate value of parameter C for different datasets of varying characteristics, in order to guarantee high accuracy of the classifier. In this paper, we introduce a distance measurement technique, using the Euclidean distance function to replace the optimal separating hyper-plane as the classification decision making function in the SVM. In our approach, the support vectors for each category are identified from the training data points during training phase using the SVM. In the classification phase, when a new data point is mapped into the original vector space, the average distances between the new data point and the support vectors from different categories are measured using the Euclidean distance function. The classification decision is made based on the category of support vectors which has the lowest average distance with the new data point, and this makes the classification decision irrespective of the efficacy of hyper-plane formed by applying the particular kernel function and soft margin parameter. We tested our proposed framework using several text datasets. The experimental results show that this approach makes the accuracy of the Euclidean-SVM text classifier to have a low impact on the implementation of kernel functions and soft margin parameter C.
Similar content being viewed by others
References
Han EH, Karypis G, Kumar V (1999) Text categorization using weighted adjusted k-nearest neighbor classification. Technical Report, Department of Computer Science and Engineering, Army HPC Research Centre, University of Minnesota, Minneapolis, USA
He J, Tan AH, Tan CL (2003) On machine learning methods for Chinese document categorization. Appl Intell 18(3):311–322
Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000) An experimental comparison of Naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pp 160–167
Chen JN, Huang HK, Tian SF, Qu YL (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3):5423–5435
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3):103–130
Eyheramendy S, Genkin A, Ju WH, Lewis D, Madigan D (2003) Sparce Bayesian classifiers for text categorization. Technical Report, Department of Statistics, Rutgers University, 2003. URL:http://www.stat.rutgers.edu/~madigan/PAPERS/jicrd-v13.pdf
Kim SB, Rim HC, Yook DS, Lim HS (2002) Effective methods for improving Naïve Bayes text classification. In: Proceedings of the 7th Pacific Rim international conference on artificial intelligence. Springer, Heidelberg, pp 414–423
Lee LH, Isa D, Choo WO, Chue WY (2010) Tournament structure ranking techniques for Bayesian text classification with highly similar categories. J Appl Sci—Asian Netw Sci Inf 10(13):1243–1254
Lee LH, Isa D (2010) Automatically computed document dependent weighting factor facility for Naïve Bayes classification. Expert Syst Appl 37(12):8471–8478
McCallum A, Nigam K (1998) A comparison of event models for Naïve Bayes text classification. In: AAAI-98 workshop on learning for text categorization, pp 41–48
O’Brien C, Vogel C (2003) Spam filters: Bayes vs. chi-squared. Letters vs. words. In: Proceedings of the 1st international symposium on information and communication technologies, pp 298–303
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: AAAI-98 workshop on learning for text categorization, Madison, Wisconsin, pp 55–62
Diederich J, Kindermann J, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19(1–2):109–123
Isa D, Lee LH, Kallimani VP, Rajkumar R (2008) Text document pre-processing with the Bayes formula for classification using the support vector machine. IEEE Trans Knowl Data Eng 20(9):1264–1272
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning (ECML-98), pp 137–142
Joachims T (1999) Making large-scale SVM learning practical. In: Advances in kernel methods----support vector learning, pp 169–184
Joachims T (2002) Learning to classify text using Support Vector Machines. Kluwer Academic Publishers, Dordrecht
Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In: Proceedings of the IJCAI-99 workshop on machine learning for information filtering, pp 61–67
Greiner R, Schaffer J (2001) AIxploratorium—decision trees. Department of Computing Science, University of Alberta, Edmonton, AB T6G 2H1, Canada. URL:http://www.cs.ualberta.ca/~aixplore/learning/DecisionTrees
Apte C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Sys 12(3):233–251
Apte C, Damerau F, Weiss SM (1994) Towards language independent automated learning of text categorization models. In: Proceedings of the 17th annual international ACM-SIGIR conference on research and development in information retrieval, pp 23–30
Chen CM, Lee HM, Hwang CW (2005) A hierarchical neural network document classifier with linguistic feature selection. Appl Intell 23(3):5423–5435
Isa D, Kallimani VP, Lee LH (2009) Using self-organizing map for clustering of text document. Expert Syst Appl 36(5):9584–9591
Lee CH, Yang HC (2003) A multilingual text mining approach based on self-organizing maps. Appl Intell 18(3):295–310
Bosnic Z, Kononenko I (2008) Estimation of individual prediction reliability using the local sensitivity analysis. Appl Intell 29(3):187–203
Hao PY, Chiang JH, Lin YH (2009) A new maximal-margin spherical-structured multi-class support vector machine. Appl Intell 30(2):98–111
Kocsor A, Toth L (2004) Application of kernel-based feature space transformations and learning methods to phoneme classification. Appl Intell 21(2):129–142
Kyriacou E, Pattichis MS, Pattichis CS, Mavrommatis A, Christodoulou CI, Kakkos S, Nicolaides A (2009) Classification of atherosclerotic carotid plaques using morphological analysis on ultrasound images. Appl Intell 30(1):3–23
Li YM, Lai CY, Kao CP (2011) Building a qualitative recruitment system via SVM with MCDM approach. Appl Intell 35(1):75–88
Li C, Liu K, Wang H (2011) The incremental learning algorithm with support vector machine based on hyperplane-distance. Appl Intell 34(1):19–27
Maglogiannis I, Zafiropoulos E, Anagnostopoulos I (2009) An intelligent system for automated breast cancer diagnosis and prognosis using svm based classifiers. Appl Intell 30(1):24–36
Mahmoud SA, Al-Khatib WG (2010) Recognition of Arabic (Indian) bank check digits using log-Gabor filters. Appl Intell. doi:10.1007/s10489-010-0235-2
Maudes J, Rodriguez JJ, Garcia-Osorio C, Pardo C (2011) Random projections for linear SVM ensembles. Appl Intell 34(3):347–359
Yu B, Yang Z (2009) A dynamic holding strategy in public transit systems with real-time information. Appl Intell 31(1):69–80
Chakrabarti S, Roy S, Soundalgekar MV (2003) Fast and accurate text classification via multiple linear discriminant projection. VLDB J 12(2):170–185
Yang YM, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’99), pp 42–49
Haykin S (1999) Neural network, a comprehensive foundation, 2nd edn. Prentice Hall, New York
Burges CJC (1998) A tutorial on Support Vector Machines for pattern recognition. Bell Laboratories, Lucent Technologies. Data Mining and Knowledge Discovery. URL:http://research.microsoft.com/~cburges/papers/SVMTutorial.pdf
Shawe-Taylor J, Cristianini N (2004) kernel methods for pattern analysis. Cambridge University Press, Cambridge
Alpaydin E (2004) Introduction to machine learning. MIT Press, Cambridge
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425
Staelin C (2003) Parameter selection for Support Vector Machines. Technical Report HPL-2002-354R1, Hewlett Packard Laboratories
Quang AT, Zhang QL, Li X (2002) Evolving Support Vector Machine parameters. In: Proceedings of the 1st international conference on machine learning and cybernetics, pp 548–551
Friedrichs F, Igel C (2004) Evolutionary tuning of multiple SVM parameters. In: Proceedings of European symposium on artificial neural networks (ESANN’2004), pp 519–524
Briggs T, Oates T (2005) Discovering domain-specific composite kernels. In: Proceedings of the 20th national conference of artificial intelligence. AAAI Press, Menlo Park, pp 732–738
Dong Y, Xia Z, Tu M (2007) Selecting optimal parameters in Support Vector Machines. In: Proceedings of the IEEE 6th international conference on machine learning and applications (ICMLA07).
Avci E (2009) Selecting of the optimal feature subset and kernel parameters in digital modulation classification by using hybrid genetic algorithm-support vector machines: HGASVM. Expert Syst Appl 36(2):1391–1402
Zhang Q, Shan G, Duan X, Zhang Z (2009) Parameters optimization of Support Vector Machine based on simulated annealing and genetic algorithm. In: Proceedings of the IEEE international conference on robotics and biomimetics, pp 1302–1306
Diosan L, Rogozan A, Pecuchet JP (2010) Improving classification performance of Support Vector Machine by genetically optimising kernel shape and hyper-parameters. Appl Intell doi:10.1007/s10489-010-0260-1
Sun J (2008) Fast tuning of SVM kernel parameter using distance between two classes. In: Proceedings of the 3rd international conference on intelligent system and knowledge engineering (ISKE2008), pp 108–113
Sun J, Zheng C, Li X, Zhou Y (2010) Analysis of the distance between two classes for tuning SVM hyperparameters. IEEE Trans Neural Netw 21(2):305–318
Wu KP, Wang SD (2009) Choosing the kernel parameters for Support Vector Machines by the inter-cluster distance in the feature space. Pattern Recognit 42(5):710–717
Buck TAE, Zhang B (2006) SVM kernel optimization: an example in yeast protein subcellular localization prediction. Project Report, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
Doniger S, Hofmann T, Yeh J (2002) Predicting CNS permeability of drugs molecules: comparison of neural network and Support Vector Machines algorithms. J Comput Biol 9(6):849–864
Kim H, Cha S (2005) Empirical evaluation of SVM-based masquerade detection using UNIX commands. Comput Secur 24(2):160–168
Li H, Jiang T (2004) A class of edit kernels for SVMs to predict translation initiation in eukaryotic mRNAs. In: Proceedings of the 8th annual international conference on research in computational molecular biology, pp 262–271
Lu M, P Chen L, Huo J, Wang X (2008) Optimization of combined kernel function for SVM based on large margin learning theory. In: Proceedings of the IEEE international conference on systems, man and cybernetics (SMC 2008), pp 353–358
Scholköpf B, Burgers CJC, Smola AJ (1999) Advances in kernel methods: support vector learning. MIT Press, Cambridge
Yuan R, Li Z, Guan X, Xu L (2010) An SVM-based machine learning method for accurate Internet traffic classification. Inf Syst Front 12(2):149–156
Lee LH, Rajkumar R, Isa D (2010) Automatic folder allocation system using Bayesian-support Vector Machines hybrid classification approach. Appl Intell. doi:10.1007/s10489-010-0261-0
Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to construct knowledge bases from the World Wide Web. In: Proceedings of the 15th national conference for artificial intelligence, pp 509–516
Callut J, Franscoisse K, Saerens M, Dupont P (2008) Semi-supervised classification from discriminative random walks. In: Proceedings of the 2008 European conference on machine learning and knowledge discovery in databases—Part 1 (ECML PKDD ’08), pp 162–177
Ko Y, Seo J (2009) Text classification from unlabeled documents with bootstrapping and feature projection techniques. Inf Process Manag 45(1):70–83
Li T, Zhu S, Ogihara M (2008) Text categorization via generalized discriminant analysis. Inf Process Manag 44(5):1684–1697
Xue XB, Zhou ZH (2009) Distributional features for text categorization. IEEE Trans Knowl Data Eng 21(3), 428–442
Zhang D, Mao R (2008) A new kernel for classification of networked entities. In: Proceedings of the 6th international workshop on mining and learning with graphs, Helsinki, Finland
Chang C, Lin C (2001) LIBSVM: a library for support vector machines. Software available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm
Cardoso-Cachopo A (2011) Datasets for single label text categorization. Artificial Intelligence Group, Department of Information Systems and Computer Science, Instituto Superior Tecnico, Portugal. URL:http://web.ist.utl.pt/~acardoso/datasets/
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lee, L.H., Wan, C.H., Rajkumar, R. et al. An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization. Appl Intell 37, 80–99 (2012). https://doi.org/10.1007/s10489-011-0314-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-011-0314-z