Article

OCFS: optimal orthogonal centroid feature selection for text categorization

Authors:
Jun Yan

Peking University, Beijing, P. R. China

Peking University, Beijing, P. R. China
View Profile

,
Ning Liu

Tsinghua University, Beijing, P. R. China

Tsinghua University, Beijing, P. R. China
View Profile

,
Benyu Zhang

Microsoft Research Asia, Beijing, P. R. China

Microsoft Research Asia, Beijing, P. R. China
View Profile

,
Shuicheng Yan

Chinese University of Hong Kong, Hong Kong

Chinese University of Hong Kong, Hong Kong
View Profile

,
Zheng Chen

Microsoft Research Asia, Beijing, P. R. China

Microsoft Research Asia, Beijing, P. R. China
View Profile

,
Qiansheng Cheng

Peking University, Beijing, P. R. China

Peking University, Beijing, P. R. China
View Profile

,
Weiguo Fan

Virginia Polytechnic Institute and State University, Blacksburg, VA

Virginia Polytechnic Institute and State University, Blacksburg, VA
View Profile

,
Wei-Ying Ma

Microsoft Research Asia, Beijing, P. R. China

Microsoft Research Asia, Beijing, P. R. China
View Profile

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2005Pages 122–129https://doi.org/10.1145/1076034.1076058

Published:15 August 2005Publication History

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 122–129

ABSTRACT

Text categorization is an important research area in many Information Retrieval (IR) applications. To save the storage space and computation time in text categorization, efficient and effective algorithms for reducing the data before analysis are highly desired. Traditional techniques for this purpose can generally be classified into feature extraction and feature selection. Because of efficiency, the latter is more suitable for text data such as web documents. However, many popular feature selection techniques such as Information Gain (IG) andχ²-test (CHI) are all greedy in nature and thus may not be optimal according to some criterion. Moreover, the performance of these greedy methods may be deteriorated when the reserved data dimension is extremely low. In this paper, we propose an efficient optimal feature selection algorithm by optimizing the objective function of Orthogonal Centroid (OC) subspace learning algorithm in a discrete solution space, called Orthogonal Centroid Feature Selection (OCFS). Experiments on 20 Newsgroups (20NG), Reuters Corpus Volume 1 (RCV1) and Open Directory Project (ODP) data show that OCFS is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small.

References

Belkin, N.J. and Croft, W.B. Retrieval Techniques. Annual Review of Information Science and Technology, 220. 109--145. Google ScholarDigital Library
Douglas H., Ioannis T. and Aliferis, C.F., A theoretical characterization of linear SVM-based feature selection. In Proceedings of the Twenty-first International Conference on Machine Learning (ICML 04), (Banff, Alberta, Canada, 2004). Google ScholarDigital Library
Duda, R.O., Hart, P.E. and Stork, D.G. Pattern Classification (2nd Ed.). WILEY, 2000. Google ScholarDigital Library
Franca D., Fabrizio S., Supervised term Weighting for Automated Text Categorization. In proceedings of the 2003 ACM symposium on Applied Computing, (Florida), 784--788. Google ScholarDigital Library
Greengrass, E. Information Retrieval: A Survey, 30 November 2000.Google Scholar
Howland, P. and Park, H. Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26 (8). 995--1006. Google ScholarDigital Library
James E. Gentle, J. Chambers, W. Eddy, W. Haerdle, S. Sheather and Tierney, L. Numerical Linear Algebra for Applications in Statistics. Springer-Verlag, Berlin, 1998.Google Scholar
Jolliffe, I.T. Principal Component Analysis. New York: Spriger Verlag, 1986.Google Scholar
Lang, K. and NewsWeeder, Learning to Filter Netnews. In Proceedings of the 12th International Conference on Machine Learning (ICML 95), (Morgan Kaufmann, 1995).Google Scholar
Lewis, D., Yang, Y., Rose, T. and Li, F. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research. Google ScholarDigital Library
Lewis, D., Feature Selection and Feature Extraction for Text Categorization. In Proceedings of the Speech and Natural Language Workshop, (Morgan Kaufmann, 1992). Google ScholarDigital Library
Li, H., Jiang, T., and Zhang, K., Efficient and Robust Feature Extraction by Maximum Margin Criterion. In Proceedings of the Advances in Neural Information Processing Systems, (Vancouver, Canada, 2003), 97--104.Google Scholar
Liu, H. and Motoda, H. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic, Norwell, MA, USA, 1998. Google ScholarDigital Library
Jeon M., Park, H. and Rosen, J.B. Dimension Reduction Based on Centroids and Least Squares for Efficient Processing of Text Data, Minneapolis, MN, University of Minnesota.Google Scholar
Malhi, A. and Gao, R.X. PCA-Based Feature Selection Scheme for Machine Defect Classification. IEEE Transactions on Instrumentation and Measurement, 53. 1517--1525.Google Scholar
Martinez, A.M. and Kak, A.C. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23. 228--233. Google ScholarDigital Library
Mitchell, T.M. Machine Learning. McGraw Hill, 1997. Google ScholarDigital Library
Platt, J. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods: support vector learning, MIT Press, Cambridge, 1999, 185--208. Google ScholarDigital Library
Ran G.-B., Amir N. and Tishby, N., Margin based feature selection - theory and algorithms. In Proceedings of the Twenty-first International Conference on Machine Learning (ICML 04), (Banff, Alberta, Canada, 2004). Google ScholarDigital Library
Ricardo B.-Y. and Ribeiro N. B. Modern Information Retrival. Addison-Wesley, 1999. Google ScholarDigital Library
Roweis, S.T. and Saul, L.K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 2000 Dec 22; 290(5500):2323-6.Google Scholar
Tenenbaum J.B., Silva, V.d. and Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science, 290. 2319--2323.Google Scholar
Wang, G., Lochovsky, F.H. and Yang, Q., Feature Selection with Conditional Mutual Information MaxMin in Text Categorization. In Proceedings of the 13th ACM Conference on Information and Knowledge Management (CIKM 04), (Washington D.C., 2004), 342--349. Google ScholarDigital Library
Yan, J., Zhang, B., Yan, S., Chen, Z., Fan, W., Xi, W., Yang, Q., Ma, W.-Y. and Cheng, Q., IMMC: Incremental Maximum Margin Criterion. In Proceedings of the 10th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, (Seattle, WA, USA, 2004), ACM, 725--730. Google ScholarDigital Library
Yang, Y. and Pedersen, J.O., A comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML 97), (San Francisco, CA: Morgan Kaufmann, 1997), 412--420. Google ScholarDigital Library

Index Terms

OCFS: optimal orthogonal centroid feature selection for text categorization
1. Applied computing
  1. Document management and text processing

Recommendations

A Review of Feature Selection and Its Methods
Abstract
Nowadays, being in digital era the data generated by various applications are increasing drastically both row-wise and column wise; this creates a bottleneck for analytics and also increases the burden of machine learning algorithms that work for ...
Read More
A content recommendation system for effective e-learning using embedded feature selection and fuzzy DT based CNN

This paper proposes a new content recommendation system which combines the newly proposed embedded feature selection method and the new Fuzzy Temporal Logic based Decision Tree incorporated Convolutional Neural Network classifier. The newly proposed ...
Read More
A Hybrid Method of Sine Cosine Algorithm and Differential Evolution for Feature Selection
Neural Information Processing
Abstract
The feature selection is an important step to improve the performance of classifier through reducing the dimension of the dataset, so the time complexity and space complexity are reduced. There are several feature selection methods are used the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
August 2005
708 pages
ISBN:1595930345
DOI:10.1145/1076034
General Chairs:
Ricardo Baeza-Yates
University of Chile, Chile
,
Nivio Ziviani
Federal University of Minas Gerais, Brazil
,
Program Chairs:
Gary Marchionini
University of North Carolina, USA
,
Alistair Moffat
University of Melbourne, Australia
,
John Tait
University of Sunderland, UK
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 August 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
feature extraction (FE)
feature selection (FS)
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 1,248
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

OCFS: optimal orthogonal centroid feature selection for text categorization

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Review of Feature Selection and Its Methods

A content recommendation system for effective e-learning using embedded feature selection and fuzzy DT based CNN

A Hybrid Method of Sine Cosine Algorithm and Differential Evolution for Feature Selection