Skip to main content
Log in

An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This paper presents the implementation of a new text document classification framework that uses the Support Vector Machine (SVM) approach in the training phase and the Euclidean distance function in the classification phase, coined as Euclidean-SVM. The SVM constructs a classifier by generating a decision surface, namely the optimal separating hyper-plane, to partition different categories of data points in the vector space. The concept of the optimal separating hyper-plane can be generalized for the non-linearly separable cases by introducing kernel functions to map the data points from the input space into a high dimensional feature space so that they could be separated by a linear hyper-plane. This characteristic causes the implementation of different kernel functions to have a high impact on the classification accuracy of the SVM. Other than the kernel functions, the value of soft margin parameter, C is another critical component in determining the performance of the SVM classifier. Hence, one of the critical problems of the conventional SVM classification framework is the necessity of determining the appropriate kernel function and the appropriate value of parameter C for different datasets of varying characteristics, in order to guarantee high accuracy of the classifier. In this paper, we introduce a distance measurement technique, using the Euclidean distance function to replace the optimal separating hyper-plane as the classification decision making function in the SVM. In our approach, the support vectors for each category are identified from the training data points during training phase using the SVM. In the classification phase, when a new data point is mapped into the original vector space, the average distances between the new data point and the support vectors from different categories are measured using the Euclidean distance function. The classification decision is made based on the category of support vectors which has the lowest average distance with the new data point, and this makes the classification decision irrespective of the efficacy of hyper-plane formed by applying the particular kernel function and soft margin parameter. We tested our proposed framework using several text datasets. The experimental results show that this approach makes the accuracy of the Euclidean-SVM text classifier to have a low impact on the implementation of kernel functions and soft margin parameter C.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Han EH, Karypis G, Kumar V (1999) Text categorization using weighted adjusted k-nearest neighbor classification. Technical Report, Department of Computer Science and Engineering, Army HPC Research Centre, University of Minnesota, Minneapolis, USA

  2. He J, Tan AH, Tan CL (2003) On machine learning methods for Chinese document categorization. Appl Intell 18(3):311–322

    Article  MATH  Google Scholar 

  3. Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000) An experimental comparison of Naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pp 160–167

    Chapter  Google Scholar 

  4. Chen JN, Huang HK, Tian SF, Qu YL (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3):5423–5435

    Google Scholar 

  5. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3):103–130

    Article  MATH  Google Scholar 

  6. Eyheramendy S, Genkin A, Ju WH, Lewis D, Madigan D (2003) Sparce Bayesian classifiers for text categorization. Technical Report, Department of Statistics, Rutgers University, 2003. URL:http://www.stat.rutgers.edu/~madigan/PAPERS/jicrd-v13.pdf

  7. Kim SB, Rim HC, Yook DS, Lim HS (2002) Effective methods for improving Naïve Bayes text classification. In: Proceedings of the 7th Pacific Rim international conference on artificial intelligence. Springer, Heidelberg, pp 414–423

    Google Scholar 

  8. Lee LH, Isa D, Choo WO, Chue WY (2010) Tournament structure ranking techniques for Bayesian text classification with highly similar categories. J Appl Sci—Asian Netw Sci Inf 10(13):1243–1254

    Google Scholar 

  9. Lee LH, Isa D (2010) Automatically computed document dependent weighting factor facility for Naïve Bayes classification. Expert Syst Appl 37(12):8471–8478

    Article  Google Scholar 

  10. McCallum A, Nigam K (1998) A comparison of event models for Naïve Bayes text classification. In: AAAI-98 workshop on learning for text categorization, pp 41–48

    Google Scholar 

  11. O’Brien C, Vogel C (2003) Spam filters: Bayes vs. chi-squared. Letters vs. words. In: Proceedings of the 1st international symposium on information and communication technologies, pp 298–303

    Google Scholar 

  12. Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: AAAI-98 workshop on learning for text categorization, Madison, Wisconsin, pp 55–62

    Google Scholar 

  13. Diederich J, Kindermann J, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19(1–2):109–123

    Article  MATH  Google Scholar 

  14. Isa D, Lee LH, Kallimani VP, Rajkumar R (2008) Text document pre-processing with the Bayes formula for classification using the support vector machine. IEEE Trans Knowl Data Eng 20(9):1264–1272

    Article  Google Scholar 

  15. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning (ECML-98), pp 137–142

    Google Scholar 

  16. Joachims T (1999) Making large-scale SVM learning practical. In: Advances in kernel methods----support vector learning, pp 169–184

    Google Scholar 

  17. Joachims T (2002) Learning to classify text using Support Vector Machines. Kluwer Academic Publishers, Dordrecht

    Book  Google Scholar 

  18. Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In: Proceedings of the IJCAI-99 workshop on machine learning for information filtering, pp 61–67

    Google Scholar 

  19. Greiner R, Schaffer J (2001) AIxploratorium—decision trees. Department of Computing Science, University of Alberta, Edmonton, AB T6G 2H1, Canada. URL:http://www.cs.ualberta.ca/~aixplore/learning/DecisionTrees

  20. Apte C, Damerau F, Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Trans Inf Sys 12(3):233–251

    Article  Google Scholar 

  21. Apte C, Damerau F, Weiss SM (1994) Towards language independent automated learning of text categorization models. In: Proceedings of the 17th annual international ACM-SIGIR conference on research and development in information retrieval, pp 23–30

    Google Scholar 

  22. Chen CM, Lee HM, Hwang CW (2005) A hierarchical neural network document classifier with linguistic feature selection. Appl Intell 23(3):5423–5435

    Article  Google Scholar 

  23. Isa D, Kallimani VP, Lee LH (2009) Using self-organizing map for clustering of text document. Expert Syst Appl 36(5):9584–9591

    Article  Google Scholar 

  24. Lee CH, Yang HC (2003) A multilingual text mining approach based on self-organizing maps. Appl Intell 18(3):295–310

    Article  MathSciNet  MATH  Google Scholar 

  25. Bosnic Z, Kononenko I (2008) Estimation of individual prediction reliability using the local sensitivity analysis. Appl Intell 29(3):187–203

    Article  Google Scholar 

  26. Hao PY, Chiang JH, Lin YH (2009) A new maximal-margin spherical-structured multi-class support vector machine. Appl Intell 30(2):98–111

    Article  Google Scholar 

  27. Kocsor A, Toth L (2004) Application of kernel-based feature space transformations and learning methods to phoneme classification. Appl Intell 21(2):129–142

    Article  MATH  Google Scholar 

  28. Kyriacou E, Pattichis MS, Pattichis CS, Mavrommatis A, Christodoulou CI, Kakkos S, Nicolaides A (2009) Classification of atherosclerotic carotid plaques using morphological analysis on ultrasound images. Appl Intell 30(1):3–23

    Article  Google Scholar 

  29. Li YM, Lai CY, Kao CP (2011) Building a qualitative recruitment system via SVM with MCDM approach. Appl Intell 35(1):75–88

    Article  Google Scholar 

  30. Li C, Liu K, Wang H (2011) The incremental learning algorithm with support vector machine based on hyperplane-distance. Appl Intell 34(1):19–27

    Article  MATH  Google Scholar 

  31. Maglogiannis I, Zafiropoulos E, Anagnostopoulos I (2009) An intelligent system for automated breast cancer diagnosis and prognosis using svm based classifiers. Appl Intell 30(1):24–36

    Article  Google Scholar 

  32. Mahmoud SA, Al-Khatib WG (2010) Recognition of Arabic (Indian) bank check digits using log-Gabor filters. Appl Intell. doi:10.1007/s10489-010-0235-2

  33. Maudes J, Rodriguez JJ, Garcia-Osorio C, Pardo C (2011) Random projections for linear SVM ensembles. Appl Intell 34(3):347–359

    Article  Google Scholar 

  34. Yu B, Yang Z (2009) A dynamic holding strategy in public transit systems with real-time information. Appl Intell 31(1):69–80

    Article  Google Scholar 

  35. Chakrabarti S, Roy S, Soundalgekar MV (2003) Fast and accurate text classification via multiple linear discriminant projection. VLDB J 12(2):170–185

    Article  Google Scholar 

  36. Yang YM, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’99), pp 42–49

    Chapter  Google Scholar 

  37. Haykin S (1999) Neural network, a comprehensive foundation, 2nd edn. Prentice Hall, New York

    Google Scholar 

  38. Burges CJC (1998) A tutorial on Support Vector Machines for pattern recognition. Bell Laboratories, Lucent Technologies. Data Mining and Knowledge Discovery. URL:http://research.microsoft.com/~cburges/papers/SVMTutorial.pdf

  39. Shawe-Taylor J, Cristianini N (2004) kernel methods for pattern analysis. Cambridge University Press, Cambridge

    Book  Google Scholar 

  40. Alpaydin E (2004) Introduction to machine learning. MIT Press, Cambridge

    Google Scholar 

  41. Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425

    Article  Google Scholar 

  42. Staelin C (2003) Parameter selection for Support Vector Machines. Technical Report HPL-2002-354R1, Hewlett Packard Laboratories

  43. Quang AT, Zhang QL, Li X (2002) Evolving Support Vector Machine parameters. In: Proceedings of the 1st international conference on machine learning and cybernetics, pp 548–551

    Chapter  Google Scholar 

  44. Friedrichs F, Igel C (2004) Evolutionary tuning of multiple SVM parameters. In: Proceedings of European symposium on artificial neural networks (ESANN’2004), pp 519–524

    Google Scholar 

  45. Briggs T, Oates T (2005) Discovering domain-specific composite kernels. In: Proceedings of the 20th national conference of artificial intelligence. AAAI Press, Menlo Park, pp 732–738

    Google Scholar 

  46. Dong Y, Xia Z, Tu M (2007) Selecting optimal parameters in Support Vector Machines. In: Proceedings of the IEEE 6th international conference on machine learning and applications (ICMLA07).

    Google Scholar 

  47. Avci E (2009) Selecting of the optimal feature subset and kernel parameters in digital modulation classification by using hybrid genetic algorithm-support vector machines: HGASVM. Expert Syst Appl 36(2):1391–1402

    Article  Google Scholar 

  48. Zhang Q, Shan G, Duan X, Zhang Z (2009) Parameters optimization of Support Vector Machine based on simulated annealing and genetic algorithm. In: Proceedings of the IEEE international conference on robotics and biomimetics, pp 1302–1306

    Google Scholar 

  49. Diosan L, Rogozan A, Pecuchet JP (2010) Improving classification performance of Support Vector Machine by genetically optimising kernel shape and hyper-parameters. Appl Intell doi:10.1007/s10489-010-0260-1

  50. Sun J (2008) Fast tuning of SVM kernel parameter using distance between two classes. In: Proceedings of the 3rd international conference on intelligent system and knowledge engineering (ISKE2008), pp 108–113

    Google Scholar 

  51. Sun J, Zheng C, Li X, Zhou Y (2010) Analysis of the distance between two classes for tuning SVM hyperparameters. IEEE Trans Neural Netw 21(2):305–318

    Article  Google Scholar 

  52. Wu KP, Wang SD (2009) Choosing the kernel parameters for Support Vector Machines by the inter-cluster distance in the feature space. Pattern Recognit 42(5):710–717

    Article  MATH  Google Scholar 

  53. Buck TAE, Zhang B (2006) SVM kernel optimization: an example in yeast protein subcellular localization prediction. Project Report, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA

  54. Doniger S, Hofmann T, Yeh J (2002) Predicting CNS permeability of drugs molecules: comparison of neural network and Support Vector Machines algorithms. J Comput Biol 9(6):849–864

    Article  Google Scholar 

  55. Kim H, Cha S (2005) Empirical evaluation of SVM-based masquerade detection using UNIX commands. Comput Secur 24(2):160–168

    Article  Google Scholar 

  56. Li H, Jiang T (2004) A class of edit kernels for SVMs to predict translation initiation in eukaryotic mRNAs. In: Proceedings of the 8th annual international conference on research in computational molecular biology, pp 262–271

    Google Scholar 

  57. Lu M, P Chen L, Huo J, Wang X (2008) Optimization of combined kernel function for SVM based on large margin learning theory. In: Proceedings of the IEEE international conference on systems, man and cybernetics (SMC 2008), pp 353–358

    Google Scholar 

  58. Scholköpf B, Burgers CJC, Smola AJ (1999) Advances in kernel methods: support vector learning. MIT Press, Cambridge

    Google Scholar 

  59. Yuan R, Li Z, Guan X, Xu L (2010) An SVM-based machine learning method for accurate Internet traffic classification. Inf Syst Front 12(2):149–156

    Article  Google Scholar 

  60. Lee LH, Rajkumar R, Isa D (2010) Automatic folder allocation system using Bayesian-support Vector Machines hybrid classification approach. Appl Intell. doi:10.1007/s10489-010-0261-0

  61. Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to construct knowledge bases from the World Wide Web. In: Proceedings of the 15th national conference for artificial intelligence, pp 509–516

    Google Scholar 

  62. Callut J, Franscoisse K, Saerens M, Dupont P (2008) Semi-supervised classification from discriminative random walks. In: Proceedings of the 2008 European conference on machine learning and knowledge discovery in databases—Part 1 (ECML PKDD ’08), pp 162–177

    Chapter  Google Scholar 

  63. Ko Y, Seo J (2009) Text classification from unlabeled documents with bootstrapping and feature projection techniques. Inf Process Manag 45(1):70–83

    Article  Google Scholar 

  64. Li T, Zhu S, Ogihara M (2008) Text categorization via generalized discriminant analysis. Inf Process Manag 44(5):1684–1697

    Article  Google Scholar 

  65. Xue XB, Zhou ZH (2009) Distributional features for text categorization. IEEE Trans Knowl Data Eng 21(3), 428–442

    Article  MathSciNet  Google Scholar 

  66. Zhang D, Mao R (2008) A new kernel for classification of networked entities. In: Proceedings of the 6th international workshop on mining and learning with graphs, Helsinki, Finland

    Google Scholar 

  67. Chang C, Lin C (2001) LIBSVM: a library for support vector machines. Software available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm

  68. Cardoso-Cachopo A (2011) Datasets for single label text categorization. Artificial Intelligence Group, Department of Information Systems and Computer Science, Instituto Superior Tecnico, Portugal. URL:http://web.ist.utl.pt/~acardoso/datasets/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lam Hong Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, L.H., Wan, C.H., Rajkumar, R. et al. An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization. Appl Intell 37, 80–99 (2012). https://doi.org/10.1007/s10489-011-0314-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-011-0314-z

Keywords

Navigation