Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization

Du, Min; Chen, Xing-shu

doi:10.1631/jzus.C1200303

Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization

Published: 12 June 2013

Volume 14, pages 407–416, (2013)
Cite this article

Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Min Du¹ &
Xing-shu Chen¹

274 Accesses
7 Citations
Explore all metrics

Abstract

Text categorization is a significant technique to manage the surging text data on the Internet. The k-nearest neighbors (kNN) algorithm is an effective, but not efficient, classification model for text categorization. In this paper, we propose an effective strategy to accelerate the standard kNN, based on a simple principle: usually, near points in space are also near when they are projected into a direction, which means that distant points in the projection direction are also distant in the original space. Using the proposed strategy, most of the irrelevant points can be removed when searching for the k-nearest neighbors of a query point, which greatly decreases the computation cost. Experimental results show that the proposed strategy greatly improves the time performance of the standard kNN, with little degradation in accuracy. Specifically, it is superior in applications that have large and high-dimensional datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supervised Classification Algorithms in Machine Learning: A Survey and Review

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

Feature dimensionality reduction: a review

Article Open access 21 January 2022

References

Aghbari, Z., 2005. Array-index: a plug&search K nearest neighbors method for high-dimensional data. Data Knowl. Eng., 52(3):333–352. [doi:10.1016/j.datak.2004.06.015]
Google Scholar
Chen, E.H., Lin, Y.G., Xiong, H., Luo, Q.M., Ma, H.P., 2011. Exploiting probabilistic topic models to improve text categorization under class imbalance. Inf. Process. Manag., 47(2):202–214. [doi:10.1016/j.ipm.2010.07.003]
Article Google Scholar
Coussement, K., van den Poel, D., 2008. Integrating the voice of customers through call center emails into a decision support system for churn prediction. Inf. Manag., 45(3): 164–174. [doi:10.1016/j.im.2008.01.005]
Article Google Scholar
de Souza, A.F., Pedroni, F., Oliveira, E., Ciarelli, P.M., Henrique, W.F., Veronese, L., Badue, C., 2009. Automated multi-label text categorization with VG-RAM weightless neural networks. Neurocomputing, 72(10–12):2209–2217. [doi:10.1016/j.neucom.2008.06.028]
Article Google Scholar
He, J., Tan, A.H., Tan, C.L., 2003. On machine learning methods for Chinese document categorization. Appl. Intell., 18(3):311–322. [doi:10.1023/A:1023202221875]
Article MATH Google Scholar
Jagadish, H.V., Ooi, B.C., Tan, K.L., Yu, C., Zhang, R., 2005. iDistance: an adaptive B⁺-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst., 30(2):364–397. [doi:10.1145/1071610.1071612]
Article Google Scholar
Jiang, J.Y., Tsai, S.C., Lee, S.J., 2012. FSKNN: multi-label text categorization based on fuzzy similarity and k nearest neighbors. Expert Syst. Appl., 39(3):2813–2821. [doi:10. 1016/j.eswa.2011.08.141]
Article Google Scholar
Jiang, S.Y., Pang, G.S., Wu, M.L., Kuang, L.M., 2012. An improved K-nearest-neighbor algorithm for text categorization. Expert Syst. Appl., 39(1):1503–1509. [doi:10. 1016/j.eswa.2011.08.040]
Article Google Scholar
Lee, L.H., Isa, D., Choo, W.O., Chue, W.Y., 2012. High relevance keyword extraction facility for Bayesian text classification on different domains of varying characteristic. Expert Syst. Appl., 39(1):1147–1155. [doi:10.1016/j.eswa. 2011.07.116]
Article Google Scholar
Liu, B., 2011. Web Data Mining (2nd Ed.). Springer, Berlin, Heidelberg, p.217. [doi:10.1007/978-3-642-19460-3]
Book MATH Google Scholar
Miao, Y.Q., Kamel, M., 2011. Pairwise optimized Rocchio algorithm for text categorization. Pattern Recogn. Lett., 32(2):375–382. [doi:10.1016/j.patrec.2010.09.018]
Article Google Scholar
Qi, X.G., Davison, B.D., 2009. Web page classification: features and algorithms. ACM Comput. Surv., 41(2):12–42. [doi:10.1145/1459352.1459357]
Article Google Scholar
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z., 2007. A novel feature selection algorithm for text categorization. Expert Syst. Appl., 33(1):1–5. [doi:10.1016/j.eswa.2006.04.001]
Article Google Scholar
Wang, B.K., Huang, Y.F., Yang, W.X., Li, X., 2012. Short text classification based on strong feature thesaurus. J. Zhejiang Univ.-Sci C (Comput. & Electron.), 13(9):649–659. [doi:10.1631/jzus.C1100373]
Article Google Scholar
Wang, S.G., Li, D.Y., Song, X.L., Wei, Y.J., Li, H.X., 2011. A feature selection method based on improved Fisher’s discriminant ratio for text sentiment classification. Expert Syst. Appl., 38(7):8696–8702. [doi:10.1016/j.eswa.2011.01.077]
Article Google Scholar
Wang, Y., Wang, Z.O., 2007. A Fast KNN Algorithm for Text Categorization. Proc. 6th Int. Conf. on Machine Learning and Cybernetics, p.3436–3441. [doi:10.1109/ICMLC.2007.4370742]
Google Scholar
Zhang, X., Huang, H., Zhang, K., 2009. KNN Text Categorization Algorithm Based on Semantic Centre. Proc. Int. Conf. on Information Technology and Computer Science, p.249–252. [doi:10.1109/ITCS.2009.57]
Google Scholar
Zhou, B., Yao, Y.Y., Luo, J., 2010. A Three-Way Decision Approach to Email Spam Filtering. Proc. 23rd Canadian Conf. on Artificial Intelligence, p.28–39. [doi:10.1007/978-3-642-13059-5_6]
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Sichuan University, Chengdu, 610065, China
Min Du & Xing-shu Chen

Authors

Min Du
View author publications
You can also search for this author in PubMed Google Scholar
Xing-shu Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xing-shu Chen.

Additional information

Project (No. 2012BAH18B05) supported by the National Key Technology R&D Program of China

Rights and permissions

Reprints and permissions

About this article

Cite this article

Du, M., Chen, Xs. Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization. J. Zhejiang Univ. - Sci. C 14, 407–416 (2013). https://doi.org/10.1631/jzus.C1200303

Download citation

Received: 24 October 2012
Accepted: 01 April 2013
Published: 12 June 2013
Issue Date: June 2013
DOI: https://doi.org/10.1631/jzus.C1200303

Key words

CLC number

TP391

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization

Abstract

Access this article

Similar content being viewed by others

Supervised Classification Algorithms in Machine Learning: A Survey and Review

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Feature dimensionality reduction: a review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization

Abstract

Access this article

Similar content being viewed by others

Supervised Classification Algorithms in Machine Learning: A Survey and Review

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Feature dimensionality reduction: a review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation