An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization

Lee, Lam Hong; Wan, Chin Heng; Rajkumar, Rajprasad; Isa, Dino

doi:10.1007/s10489-011-0314-z

An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization

Published: 25 August 2011

Volume 37, pages 80–99, (2012)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Lam Hong Lee¹,
Chin Heng Wan¹,
Rajprasad Rajkumar² &
…
Dino Isa²

1296 Accesses
95 Citations
Explore all metrics

Abstract

This paper presents the implementation of a new text document classification framework that uses the Support Vector Machine (SVM) approach in the training phase and the Euclidean distance function in the classification phase, coined as Euclidean-SVM. The SVM constructs a classifier by generating a decision surface, namely the optimal separating hyper-plane, to partition different categories of data points in the vector space. The concept of the optimal separating hyper-plane can be generalized for the non-linearly separable cases by introducing kernel functions to map the data points from the input space into a high dimensional feature space so that they could be separated by a linear hyper-plane. This characteristic causes the implementation of different kernel functions to have a high impact on the classification accuracy of the SVM. Other than the kernel functions, the value of soft margin parameter, C is another critical component in determining the performance of the SVM classifier. Hence, one of the critical problems of the conventional SVM classification framework is the necessity of determining the appropriate kernel function and the appropriate value of parameter C for different datasets of varying characteristics, in order to guarantee high accuracy of the classifier. In this paper, we introduce a distance measurement technique, using the Euclidean distance function to replace the optimal separating hyper-plane as the classification decision making function in the SVM. In our approach, the support vectors for each category are identified from the training data points during training phase using the SVM. In the classification phase, when a new data point is mapped into the original vector space, the average distances between the new data point and the support vectors from different categories are measured using the Euclidean distance function. The classification decision is made based on the category of support vectors which has the lowest average distance with the new data point, and this makes the classification decision irrespective of the efficacy of hyper-plane formed by applying the particular kernel function and soft margin parameter. We tested our proposed framework using several text datasets. The experimental results show that this approach makes the accuracy of the Euclidean-SVM text classifier to have a low impact on the implementation of kernel functions and soft margin parameter C.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Framework to Categorize Text Documents Using SMTP Measure

Multiple Support Vector Machines for Binary Text Classification Based on Sliding Window Technique

A Semantic Kernel for Text Classification Based on Iterative Higher–Order Relations between Words and Documents

References

Han EH, Karypis G, Kumar V (1999) Text categorization using weighted adjusted k-nearest neighbor classification. Technical Report, Department of Computer Science and Engineering, Army HPC Research Centre, University of Minnesota, Minneapolis, USA
He J, Tan AH, Tan CL (2003) On machine learning methods for Chinese document categorization. Appl Intell 18(3):311–322
Article MATH Google Scholar
Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000) An experimental comparison of Naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pp 160–167
Chapter Google Scholar
Chen JN, Huang HK, Tian SF, Qu YL (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3):5423–5435
Google Scholar
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3):103–130
Article MATH Google Scholar
Eyheramendy S, Genkin A, Ju WH, Lewis D, Madigan D (2003) Sparce Bayesian classifiers for text categorization. Technical Report, Department of Statistics, Rutgers University, 2003. URL:http://www.stat.rutgers.edu/~madigan/PAPERS/jicrd-v13.pdf
Kim SB, Rim HC, Yook DS, Lim HS (2002) Effective methods for improving Naïve Bayes text classification. In: Proceedings of the 7th Pacific Rim international conference on artificial intelligence. Springer, Heidelberg, pp 414–423
Google Scholar
Lee LH, Isa D, Choo WO, Chue WY (2010) Tournament structure ranking techniques for Bayesian text classification with highly similar categories. J Appl Sci—Asian Netw Sci Inf 10(13):1243–1254
Google Scholar
Lee LH, Isa D (2010) Automatically computed document dependent weighting factor facility for Naïve Bayes classification. Expert Syst Appl 37(12):8471–8478
Article Google Scholar
McCallum A, Nigam K (1998) A comparison of event models for Naïve Bayes text classification. In: AAAI-98 workshop on learning for text categorization, pp 41–48
Google Scholar
O’Brien C, Vogel C (2003) Spam filters: Bayes vs. chi-squared. Letters vs. words. In: Proceedings of the 1st international symposium on information and communication technologies, pp 298–303
Google Scholar
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: AAAI-98 workshop on learning for text categorization, Madison, Wisconsin, pp 55–62
Google Scholar
Diederich J, Kindermann J, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19(1–2):109–123
Article MATH Google Scholar
Isa D, Lee LH, Kallimani VP, Rajkumar R (2008) Text document pre-processing with the Bayes formula for classification using the support vector machine. IEEE Trans Knowl Data Eng 20(9):1264–1272
Article Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning (ECML-98), pp 137–142
Google Scholar
Joachims T (1999) Making large-scale SVM learning practical. In: Advances in kernel methods----support vector learning, pp 169–184
Google Scholar
Joachims T (2002) Learning to classify text using Support Vector Machines. Kluwer Academic Publishers, Dordrecht
Book Google Scholar
Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In: Proceedings of the IJCAI-99 workshop on machine learning for information filtering, pp 61–67
Google Scholar
Greiner R, Schaffer J (2001) AIxploratorium—decision trees. Department of Computing Science, University of Alberta, Edmonton, AB T6G 2H1, Canada. URL:http://www.cs.ualberta.ca/~aixplore/learning/DecisionTrees
Apte C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Sys 12(3):233–251
Article Google Scholar
Apte C, Damerau F, Weiss SM (1994) Towards language independent automated learning of text categorization models. In: Proceedings of the 17th annual international ACM-SIGIR conference on research and development in information retrieval, pp 23–30
Google Scholar
Chen CM, Lee HM, Hwang CW (2005) A hierarchical neural network document classifier with linguistic feature selection. Appl Intell 23(3):5423–5435
Article Google Scholar
Isa D, Kallimani VP, Lee LH (2009) Using self-organizing map for clustering of text document. Expert Syst Appl 36(5):9584–9591
Article Google Scholar
Lee CH, Yang HC (2003) A multilingual text mining approach based on self-organizing maps. Appl Intell 18(3):295–310
Article MathSciNet MATH Google Scholar
Bosnic Z, Kononenko I (2008) Estimation of individual prediction reliability using the local sensitivity analysis. Appl Intell 29(3):187–203
Article Google Scholar
Hao PY, Chiang JH, Lin YH (2009) A new maximal-margin spherical-structured multi-class support vector machine. Appl Intell 30(2):98–111
Article Google Scholar
Kocsor A, Toth L (2004) Application of kernel-based feature space transformations and learning methods to phoneme classification. Appl Intell 21(2):129–142
Article MATH Google Scholar
Kyriacou E, Pattichis MS, Pattichis CS, Mavrommatis A, Christodoulou CI, Kakkos S, Nicolaides A (2009) Classification of atherosclerotic carotid plaques using morphological analysis on ultrasound images. Appl Intell 30(1):3–23
Article Google Scholar
Li YM, Lai CY, Kao CP (2011) Building a qualitative recruitment system via SVM with MCDM approach. Appl Intell 35(1):75–88
Article Google Scholar
Li C, Liu K, Wang H (2011) The incremental learning algorithm with support vector machine based on hyperplane-distance. Appl Intell 34(1):19–27
Article MATH Google Scholar
Maglogiannis I, Zafiropoulos E, Anagnostopoulos I (2009) An intelligent system for automated breast cancer diagnosis and prognosis using svm based classifiers. Appl Intell 30(1):24–36
Article Google Scholar
Mahmoud SA, Al-Khatib WG (2010) Recognition of Arabic (Indian) bank check digits using log-Gabor filters. Appl Intell. doi:10.1007/s10489-010-0235-2
Maudes J, Rodriguez JJ, Garcia-Osorio C, Pardo C (2011) Random projections for linear SVM ensembles. Appl Intell 34(3):347–359
Article Google Scholar
Yu B, Yang Z (2009) A dynamic holding strategy in public transit systems with real-time information. Appl Intell 31(1):69–80
Article Google Scholar
Chakrabarti S, Roy S, Soundalgekar MV (2003) Fast and accurate text classification via multiple linear discriminant projection. VLDB J 12(2):170–185
Article Google Scholar
Yang YM, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’99), pp 42–49
Chapter Google Scholar
Haykin S (1999) Neural network, a comprehensive foundation, 2nd edn. Prentice Hall, New York
Google Scholar
Burges CJC (1998) A tutorial on Support Vector Machines for pattern recognition. Bell Laboratories, Lucent Technologies. Data Mining and Knowledge Discovery. URL:http://research.microsoft.com/~cburges/papers/SVMTutorial.pdf
Shawe-Taylor J, Cristianini N (2004) kernel methods for pattern analysis. Cambridge University Press, Cambridge
Book Google Scholar
Alpaydin E (2004) Introduction to machine learning. MIT Press, Cambridge
Google Scholar
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425
Article Google Scholar
Staelin C (2003) Parameter selection for Support Vector Machines. Technical Report HPL-2002-354R1, Hewlett Packard Laboratories
Quang AT, Zhang QL, Li X (2002) Evolving Support Vector Machine parameters. In: Proceedings of the 1st international conference on machine learning and cybernetics, pp 548–551
Chapter Google Scholar
Friedrichs F, Igel C (2004) Evolutionary tuning of multiple SVM parameters. In: Proceedings of European symposium on artificial neural networks (ESANN’2004), pp 519–524
Google Scholar
Briggs T, Oates T (2005) Discovering domain-specific composite kernels. In: Proceedings of the 20th national conference of artificial intelligence. AAAI Press, Menlo Park, pp 732–738
Google Scholar
Dong Y, Xia Z, Tu M (2007) Selecting optimal parameters in Support Vector Machines. In: Proceedings of the IEEE 6th international conference on machine learning and applications (ICMLA07).
Google Scholar
Avci E (2009) Selecting of the optimal feature subset and kernel parameters in digital modulation classification by using hybrid genetic algorithm-support vector machines: HGASVM. Expert Syst Appl 36(2):1391–1402
Article Google Scholar
Zhang Q, Shan G, Duan X, Zhang Z (2009) Parameters optimization of Support Vector Machine based on simulated annealing and genetic algorithm. In: Proceedings of the IEEE international conference on robotics and biomimetics, pp 1302–1306
Google Scholar
Diosan L, Rogozan A, Pecuchet JP (2010) Improving classification performance of Support Vector Machine by genetically optimising kernel shape and hyper-parameters. Appl Intell doi:10.1007/s10489-010-0260-1
Sun J (2008) Fast tuning of SVM kernel parameter using distance between two classes. In: Proceedings of the 3rd international conference on intelligent system and knowledge engineering (ISKE2008), pp 108–113
Google Scholar
Sun J, Zheng C, Li X, Zhou Y (2010) Analysis of the distance between two classes for tuning SVM hyperparameters. IEEE Trans Neural Netw 21(2):305–318
Article Google Scholar
Wu KP, Wang SD (2009) Choosing the kernel parameters for Support Vector Machines by the inter-cluster distance in the feature space. Pattern Recognit 42(5):710–717
Article MATH Google Scholar
Buck TAE, Zhang B (2006) SVM kernel optimization: an example in yeast protein subcellular localization prediction. Project Report, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
Doniger S, Hofmann T, Yeh J (2002) Predicting CNS permeability of drugs molecules: comparison of neural network and Support Vector Machines algorithms. J Comput Biol 9(6):849–864
Article Google Scholar
Kim H, Cha S (2005) Empirical evaluation of SVM-based masquerade detection using UNIX commands. Comput Secur 24(2):160–168
Article Google Scholar
Li H, Jiang T (2004) A class of edit kernels for SVMs to predict translation initiation in eukaryotic mRNAs. In: Proceedings of the 8th annual international conference on research in computational molecular biology, pp 262–271
Google Scholar
Lu M, P Chen L, Huo J, Wang X (2008) Optimization of combined kernel function for SVM based on large margin learning theory. In: Proceedings of the IEEE international conference on systems, man and cybernetics (SMC 2008), pp 353–358
Google Scholar
Scholköpf B, Burgers CJC, Smola AJ (1999) Advances in kernel methods: support vector learning. MIT Press, Cambridge
Google Scholar
Yuan R, Li Z, Guan X, Xu L (2010) An SVM-based machine learning method for accurate Internet traffic classification. Inf Syst Front 12(2):149–156
Article Google Scholar
Lee LH, Rajkumar R, Isa D (2010) Automatic folder allocation system using Bayesian-support Vector Machines hybrid classification approach. Appl Intell. doi:10.1007/s10489-010-0261-0
Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to construct knowledge bases from the World Wide Web. In: Proceedings of the 15th national conference for artificial intelligence, pp 509–516
Google Scholar
Callut J, Franscoisse K, Saerens M, Dupont P (2008) Semi-supervised classification from discriminative random walks. In: Proceedings of the 2008 European conference on machine learning and knowledge discovery in databases—Part 1 (ECML PKDD ’08), pp 162–177
Chapter Google Scholar
Ko Y, Seo J (2009) Text classification from unlabeled documents with bootstrapping and feature projection techniques. Inf Process Manag 45(1):70–83
Article Google Scholar
Li T, Zhu S, Ogihara M (2008) Text categorization via generalized discriminant analysis. Inf Process Manag 44(5):1684–1697
Article Google Scholar
Xue XB, Zhou ZH (2009) Distributional features for text categorization. IEEE Trans Knowl Data Eng 21(3), 428–442
Article MathSciNet Google Scholar
Zhang D, Mao R (2008) A new kernel for classification of networked entities. In: Proceedings of the 6th international workshop on mining and learning with graphs, Helsinki, Finland
Google Scholar
Chang C, Lin C (2001) LIBSVM: a library for support vector machines. Software available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm
Cardoso-Cachopo A (2011) Datasets for single label text categorization. Artificial Intelligence Group, Department of Information Systems and Computer Science, Instituto Superior Tecnico, Portugal. URL:http://web.ist.utl.pt/~acardoso/datasets/

Download references

Author information

Authors and Affiliations

Faculty of Information and Communication Technology, Universiti Tunku Abdul Rahman, Bandar Barat, 31900, Kampar, Perak, Malaysia
Lam Hong Lee & Chin Heng Wan
Intelligent Systems Research Group, Faculty of Engineering, The University of Nottingham, Malaysia Campus, Jalan Broga, 43500, Semenyih, Selangor, Malaysia
Rajprasad Rajkumar & Dino Isa

Authors

Lam Hong Lee
View author publications
You can also search for this author in PubMed Google Scholar
Chin Heng Wan
View author publications
You can also search for this author in PubMed Google Scholar
Rajprasad Rajkumar
View author publications
You can also search for this author in PubMed Google Scholar
Dino Isa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lam Hong Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, L.H., Wan, C.H., Rajkumar, R. et al. An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization. Appl Intell 37, 80–99 (2012). https://doi.org/10.1007/s10489-011-0314-z

Download citation

Published: 25 August 2011
Issue Date: July 2012
DOI: https://doi.org/10.1007/s10489-011-0314-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization

Abstract

Access this article

Similar content being viewed by others

A New Framework to Categorize Text Documents Using SMTP Measure

Multiple Support Vector Machines for Binary Text Classification Based on Sliding Window Technique

A Semantic Kernel for Text Classification Based on Iterative Higher–Order Relations between Words and Documents

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization

Abstract

Access this article

Similar content being viewed by others

A New Framework to Categorize Text Documents Using SMTP Measure

Multiple Support Vector Machines for Binary Text Classification Based on Sliding Window Technique

A Semantic Kernel for Text Classification Based on Iterative Higher–Order Relations between Words and Documents

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation