Abstract
Text categorization refers to the task of assigning the pre-defined classes to text documents based on their content. k-NN algorithm is one of top performing classifiers on text data. However, there is little research work on the use of different voting methods over text data. Also, when a huge number of training data is available online, the response speed slows down, since a test document has to obtain the distance with each training data. On the other hand, min–max-modular k-NN (M3-k-NN) has been applied to large-scale text categorization. M3-k-NN achieves a good performance and has faster response speed in a parallel computing environment. In this paper, we investigate five different voting methods for k-NN and M3-k-NN. The experimental results and analysis show that the Gaussian voting method can achieve the best performance among all voting methods for both k-NN and M3-k-NN. In addition, M3-k-NN uses less k-value to achieve the better performance than k-NN, and thus is faster than k-NN in a parallel computing environment.
Similar content being viewed by others
References
Bergo A (2007) Text categorization and prototypes. (In: http://www.illc.uva.nl/Publications/ResearchReports/MoL-2001-08.text.pdf)
Cover T and Hart P (1967). Nearest neighbor pattern classification. IEEE Trans Inform Theory IT-13(1): 21–27
Dudani S (1976). The distance-weighted k-nearest-neighbor rule. IEEE Trans Syst Man Cybern SMC-6: 325–327
Fan ZG, Lu BL (2005) Multi-view face recognition with min–max modular svms. In: ICNC (2), pp 396–399
Fix E, Hodges J (1951) Discriminatory analysis, nonparametric discrimination: consistency properties. Technical report, USAF Scholl of aviation and medicine, Randolph Field 4
Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Fisher DH (ed) Proceedings of ICML-97, 14th international conference on machine learning, Morgan Kaufmann Publishers, San Francisco, USA, pp 143–151
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Proceedings of ECML-98, 10th European conference on machine learning, Springer, Heidelberg, DE, pp 137–142 (Published in the “Lecture Notes in Computer Science” series, number 1398)
Lian HC, Lu BL, Takikawa E, Hosoi S (2005) Gender recognition using a min–max modular support vector machine. In: ICNC (2), pp 438–441
Lewis DD, Yang Y, Rose TG and Li F (2004). Rcv1: A new benchmark collection for text categorization research. J Mach Learn Res 5: 361–397
Liu FY, Wu K, Zhao H, Lu BL (2005a) Fast text categorization with min–max modular support vector machines. In: IEEE international joint conference on neural networks, vol 1, pp 570–575
Liu TY, Yang Y, Wan H, Zhou Q, Gao B, Zeng HJ, Chen Z, Ma WY (2005b) An experimental study on large-scale web categorization. In: WWW ’05: special interest tracks and posters of the 14th international conference on World Wide Web, ACM Press, New York, NY, USA, pp 1106–1107
Lu BL, Ichikawa M (2000) A Gaussian zero-crossing discriminat function for min–max modular neural networks. In: Proceedings of 5th international conference on knowledge-based intelligent information engineering systems and allied technologies (KES’01), pp 298–302
Lu BL, Ito M (1997) Task decomposition based on class relations: a modular neural network architecture for pattern classification. In: Mira J, Moreno-Diaz R, Cabestany J (eds) Biological and artificial computation: from neuroscience to technology, Lecture Notes in Computer Science, vol 1240. Springer, Heidelberg, pp 330–339
Lu BL and Ito M (1999). Task decomposition and module combination based on class relations: A modular neural network for pattern classification. IEEE Trans Neural Netw 10(5): 1244–1256
Lu BL, Wang KA, Utiyama M, Isahara H (2004a) A part-versus-part method for massively parallel training of support vector machines. In: Proceedings of 2004 IEEE international joint conference on neural networks, pp 735–740
Lu BL, Shin J and Ichikawa M (2004b). Massively parallel classification of single-trial EEG signals using a min–max-modular neural network. IEEE Trans Biomed Eng 3(51): 551–558
Luo J, Lu BL (2006) Gender recognition using a min–max modular support vector machine with equal clustering. In: ISNN (2), pp 210–215
Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In: IJCAI-99 workshop on machine learning for information filtering, pp 61–67
Sebastiani F (2002). Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47
Wang K, Zhao H, Lu BL (2005) Task decomposition using geometric relation for min–max-modular svms. In: ISNN (1), pp 887–892
Yang Y (1999). An evaluation of statistical approaches to text categorization. Inf Retrieval 1(1/2): 69–90
Yang Y and Chute CG (1994). An example-based mapping method for text categorization and retrieval. ACM Trans Inf Syst 12(3): 252–277
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Hearst MA, Gey F, Tong R (eds) Proceedings of SIGIR-99, 22nd ACM international conference on research and development in information retrieval, ACM Press, New York, USA, pp 42–49
Yang Y, Lu BL (2006) Prediction of protein subcellular multi-locations with a min–max modular support vector machine. In: ISNN (2), pp 667–673
Zhao H, Lu BL (2004) A modular k-nearest neighbor classification method for massively parallel text categorization. In: International symposium on computational and information sciences (CIS’04), LNCS, vol 3314, pp 867–872
Zhao H, Lu BL (2006) A modular reduction method for k-nn algorithm with self-recombination learning. In: ISNN (1), pp 537–544
Author information
Authors and Affiliations
Corresponding author
Additional information
The work of K. Wu and B. L. Lu was supported in part by the National Natural Science Foundation of China under the grants NSFC 60375022 and NSFC 60473040, and the Microsoft Laboratory for Intelligent Computing and Intelligent Systems of Shanghai Jiao Tong University.
Rights and permissions
About this article
Cite this article
Wu, K., Lu, BL., Utiyama, M. et al. An empirical comparison of min–max-modular k-NN with different voting methods to large-scale text categorization. Soft Comput 12, 647–655 (2008). https://doi.org/10.1007/s00500-007-0242-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-007-0242-3