skip to main content
10.1145/1076034.1076058acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

OCFS: optimal orthogonal centroid feature selection for text categorization

Authors Info & Claims
Published:15 August 2005Publication History

ABSTRACT

Text categorization is an important research area in many Information Retrieval (IR) applications. To save the storage space and computation time in text categorization, efficient and effective algorithms for reducing the data before analysis are highly desired. Traditional techniques for this purpose can generally be classified into feature extraction and feature selection. Because of efficiency, the latter is more suitable for text data such as web documents. However, many popular feature selection techniques such as Information Gain (IG) andχ2-test (CHI) are all greedy in nature and thus may not be optimal according to some criterion. Moreover, the performance of these greedy methods may be deteriorated when the reserved data dimension is extremely low. In this paper, we propose an efficient optimal feature selection algorithm by optimizing the objective function of Orthogonal Centroid (OC) subspace learning algorithm in a discrete solution space, called Orthogonal Centroid Feature Selection (OCFS). Experiments on 20 Newsgroups (20NG), Reuters Corpus Volume 1 (RCV1) and Open Directory Project (ODP) data show that OCFS is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small.

References

  1. Belkin, N.J. and Croft, W.B. Retrieval Techniques. Annual Review of Information Science and Technology, 220. 109--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Douglas H., Ioannis T. and Aliferis, C.F., A theoretical characterization of linear SVM-based feature selection. In Proceedings of the Twenty-first International Conference on Machine Learning (ICML 04), (Banff, Alberta, Canada, 2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Duda, R.O., Hart, P.E. and Stork, D.G. Pattern Classification (2nd Ed.). WILEY, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Franca D., Fabrizio S., Supervised term Weighting for Automated Text Categorization. In proceedings of the 2003 ACM symposium on Applied Computing, (Florida), 784--788. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Greengrass, E. Information Retrieval: A Survey, 30 November 2000.Google ScholarGoogle Scholar
  6. Howland, P. and Park, H. Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26 (8). 995--1006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. James E. Gentle, J. Chambers, W. Eddy, W. Haerdle, S. Sheather and Tierney, L. Numerical Linear Algebra for Applications in Statistics. Springer-Verlag, Berlin, 1998.Google ScholarGoogle Scholar
  8. Jolliffe, I.T. Principal Component Analysis. New York: Spriger Verlag, 1986.Google ScholarGoogle Scholar
  9. Lang, K. and NewsWeeder, Learning to Filter Netnews. In Proceedings of the 12th International Conference on Machine Learning (ICML 95), (Morgan Kaufmann, 1995).Google ScholarGoogle Scholar
  10. Lewis, D., Yang, Y., Rose, T. and Li, F. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Lewis, D., Feature Selection and Feature Extraction for Text Categorization. In Proceedings of the Speech and Natural Language Workshop, (Morgan Kaufmann, 1992). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Li, H., Jiang, T., and Zhang, K., Efficient and Robust Feature Extraction by Maximum Margin Criterion. In Proceedings of the Advances in Neural Information Processing Systems, (Vancouver, Canada, 2003), 97--104.Google ScholarGoogle Scholar
  13. Liu, H. and Motoda, H. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic, Norwell, MA, USA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jeon M., Park, H. and Rosen, J.B. Dimension Reduction Based on Centroids and Least Squares for Efficient Processing of Text Data, Minneapolis, MN, University of Minnesota.Google ScholarGoogle Scholar
  15. Malhi, A. and Gao, R.X. PCA-Based Feature Selection Scheme for Machine Defect Classification. IEEE Transactions on Instrumentation and Measurement, 53. 1517--1525.Google ScholarGoogle Scholar
  16. Martinez, A.M. and Kak, A.C. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23. 228--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mitchell, T.M. Machine Learning. McGraw Hill, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Platt, J. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods: support vector learning, MIT Press, Cambridge, 1999, 185--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ran G.-B., Amir N. and Tishby, N., Margin based feature selection - theory and algorithms. In Proceedings of the Twenty-first International Conference on Machine Learning (ICML 04), (Banff, Alberta, Canada, 2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ricardo B.-Y. and Ribeiro N. B. Modern Information Retrival. Addison-Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Roweis, S.T. and Saul, L.K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 2000 Dec 22; 290(5500):2323-6.Google ScholarGoogle Scholar
  22. Tenenbaum J.B., Silva, V.d. and Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science, 290. 2319--2323.Google ScholarGoogle Scholar
  23. Wang, G., Lochovsky, F.H. and Yang, Q., Feature Selection with Conditional Mutual Information MaxMin in Text Categorization. In Proceedings of the 13th ACM Conference on Information and Knowledge Management (CIKM 04), (Washington D.C., 2004), 342--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yan, J., Zhang, B., Yan, S., Chen, Z., Fan, W., Xi, W., Yang, Q., Ma, W.-Y. and Cheng, Q., IMMC: Incremental Maximum Margin Criterion. In Proceedings of the 10th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, (Seattle, WA, USA, 2004), ACM, 725--730. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Yang, Y. and Pedersen, J.O., A comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML 97), (San Francisco, CA: Morgan Kaufmann, 1997), 412--420. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. OCFS: optimal orthogonal centroid feature selection for text categorization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
      August 2005
      708 pages
      ISBN:1595930345
      DOI:10.1145/1076034

      Copyright © 2005 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 August 2005

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader