ABSTRACT
Text categorization is an important research area in many Information Retrieval (IR) applications. To save the storage space and computation time in text categorization, efficient and effective algorithms for reducing the data before analysis are highly desired. Traditional techniques for this purpose can generally be classified into feature extraction and feature selection. Because of efficiency, the latter is more suitable for text data such as web documents. However, many popular feature selection techniques such as Information Gain (IG) andχ2-test (CHI) are all greedy in nature and thus may not be optimal according to some criterion. Moreover, the performance of these greedy methods may be deteriorated when the reserved data dimension is extremely low. In this paper, we propose an efficient optimal feature selection algorithm by optimizing the objective function of Orthogonal Centroid (OC) subspace learning algorithm in a discrete solution space, called Orthogonal Centroid Feature Selection (OCFS). Experiments on 20 Newsgroups (20NG), Reuters Corpus Volume 1 (RCV1) and Open Directory Project (ODP) data show that OCFS is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small.
- Belkin, N.J. and Croft, W.B. Retrieval Techniques. Annual Review of Information Science and Technology, 220. 109--145. Google ScholarDigital Library
- Douglas H., Ioannis T. and Aliferis, C.F., A theoretical characterization of linear SVM-based feature selection. In Proceedings of the Twenty-first International Conference on Machine Learning (ICML 04), (Banff, Alberta, Canada, 2004). Google ScholarDigital Library
- Duda, R.O., Hart, P.E. and Stork, D.G. Pattern Classification (2nd Ed.). WILEY, 2000. Google ScholarDigital Library
- Franca D., Fabrizio S., Supervised term Weighting for Automated Text Categorization. In proceedings of the 2003 ACM symposium on Applied Computing, (Florida), 784--788. Google ScholarDigital Library
- Greengrass, E. Information Retrieval: A Survey, 30 November 2000.Google Scholar
- Howland, P. and Park, H. Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26 (8). 995--1006. Google ScholarDigital Library
- James E. Gentle, J. Chambers, W. Eddy, W. Haerdle, S. Sheather and Tierney, L. Numerical Linear Algebra for Applications in Statistics. Springer-Verlag, Berlin, 1998.Google Scholar
- Jolliffe, I.T. Principal Component Analysis. New York: Spriger Verlag, 1986.Google Scholar
- Lang, K. and NewsWeeder, Learning to Filter Netnews. In Proceedings of the 12th International Conference on Machine Learning (ICML 95), (Morgan Kaufmann, 1995).Google Scholar
- Lewis, D., Yang, Y., Rose, T. and Li, F. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research. Google ScholarDigital Library
- Lewis, D., Feature Selection and Feature Extraction for Text Categorization. In Proceedings of the Speech and Natural Language Workshop, (Morgan Kaufmann, 1992). Google ScholarDigital Library
- Li, H., Jiang, T., and Zhang, K., Efficient and Robust Feature Extraction by Maximum Margin Criterion. In Proceedings of the Advances in Neural Information Processing Systems, (Vancouver, Canada, 2003), 97--104.Google Scholar
- Liu, H. and Motoda, H. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic, Norwell, MA, USA, 1998. Google ScholarDigital Library
- Jeon M., Park, H. and Rosen, J.B. Dimension Reduction Based on Centroids and Least Squares for Efficient Processing of Text Data, Minneapolis, MN, University of Minnesota.Google Scholar
- Malhi, A. and Gao, R.X. PCA-Based Feature Selection Scheme for Machine Defect Classification. IEEE Transactions on Instrumentation and Measurement, 53. 1517--1525.Google Scholar
- Martinez, A.M. and Kak, A.C. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23. 228--233. Google ScholarDigital Library
- Mitchell, T.M. Machine Learning. McGraw Hill, 1997. Google ScholarDigital Library
- Platt, J. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods: support vector learning, MIT Press, Cambridge, 1999, 185--208. Google ScholarDigital Library
- Ran G.-B., Amir N. and Tishby, N., Margin based feature selection - theory and algorithms. In Proceedings of the Twenty-first International Conference on Machine Learning (ICML 04), (Banff, Alberta, Canada, 2004). Google ScholarDigital Library
- Ricardo B.-Y. and Ribeiro N. B. Modern Information Retrival. Addison-Wesley, 1999. Google ScholarDigital Library
- Roweis, S.T. and Saul, L.K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 2000 Dec 22; 290(5500):2323-6.Google Scholar
- Tenenbaum J.B., Silva, V.d. and Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science, 290. 2319--2323.Google Scholar
- Wang, G., Lochovsky, F.H. and Yang, Q., Feature Selection with Conditional Mutual Information MaxMin in Text Categorization. In Proceedings of the 13th ACM Conference on Information and Knowledge Management (CIKM 04), (Washington D.C., 2004), 342--349. Google ScholarDigital Library
- Yan, J., Zhang, B., Yan, S., Chen, Z., Fan, W., Xi, W., Yang, Q., Ma, W.-Y. and Cheng, Q., IMMC: Incremental Maximum Margin Criterion. In Proceedings of the 10th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, (Seattle, WA, USA, 2004), ACM, 725--730. Google ScholarDigital Library
- Yang, Y. and Pedersen, J.O., A comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML 97), (San Francisco, CA: Morgan Kaufmann, 1997), 412--420. Google ScholarDigital Library
Index Terms
- OCFS: optimal orthogonal centroid feature selection for text categorization
Recommendations
A Review of Feature Selection and Its Methods
AbstractNowadays, being in digital era the data generated by various applications are increasing drastically both row-wise and column wise; this creates a bottleneck for analytics and also increases the burden of machine learning algorithms that work for ...
A content recommendation system for effective e-learning using embedded feature selection and fuzzy DT based CNN
This paper proposes a new content recommendation system which combines the newly proposed embedded feature selection method and the new Fuzzy Temporal Logic based Decision Tree incorporated Convolutional Neural Network classifier. The newly proposed ...
A Hybrid Method of Sine Cosine Algorithm and Differential Evolution for Feature Selection
Neural Information ProcessingAbstractThe feature selection is an important step to improve the performance of classifier through reducing the dimension of the dataset, so the time complexity and space complexity are reduced. There are several feature selection methods are used the ...
Comments