Skip to main content
Log in

Feature selection method based on multiple centrifuge models

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

High-dimension of feature space in text classification is a major problem of it. Feature selection is an effective method for feature reduction. A multiple centrifuge models based feature selection method is put forward in the view of the hypothesis that the same documents have core feature set in the text classification and the classes of the same high-frequency feature words of document have affinity. The proposed feature selection algorithm made a lot of innovation ideas in the field of feature reduction which improve the values of the low-frequency features in classification meanwhile ensuring the classification effect. The experiments in the Reuters-21578 corpus show that this method has better classification effect, and effectively improves the utilization of medium or low frequency features which have strong classification ability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Garcia-Torres, M., Gomez-Vela, F., Melian, B., Moreno-Vega, J.M.: High-dimensional feature selection via feature grouping: a variable neighborhood Searc approach. Inf. Sci. 326, 102–118 (2016)

    Article  Google Scholar 

  2. Saeed, F., Salim, N., Abdo, A.: Voting-based consensus clustering for combining multiple clusterings of chemical structures. J. Cheminformatics 4(1), 1–8 (2012)

    Article  Google Scholar 

  3. Wang, Y., Mei, Y.: A multistage procedure for decentralized sequential multi-hypothesis testing problems. Seq. Anal. 31(4), 505–527 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  4. García, S., Fernández, A., Luengo, J., Herrera, F.: A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput. 13(10), 959–977 (2009)

    Article  Google Scholar 

  5. Aliferis, C.: Local causal and markov blanket induction for causal discovery and feature selection for classification part I: algorithms and empirical evaluation. J. Mach. Learn. Res. 11, 171–234 (2010)

    MathSciNet  MATH  Google Scholar 

  6. Gheyas, I.A., Smith, L.S.: Feature subset selection in large dimensionality domains. Pattern Recognit. 43(1), 5–13 (2009)

    Article  MATH  Google Scholar 

  7. Berrya, M.W., et al.: Algorithms and applications for approximate on negative matrix factorization. Comput. Stat. Data Anal. 52, 155–173 (2007)

    Article  Google Scholar 

  8. Hanchuan, P., Fuhui, L., Ding, C.: Feature selection based on mutual information criteria of max-dependency max-relevance and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)

    Article  Google Scholar 

  9. Apte, C., Damerau, F., Weiss, S.: Towards language independent automated learning of text categorization models. In: Proceedings of the 17th Annual ACM/SIGIR Conference, 1994

  10. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  11. Salton, G., Wong, A., Yang, C.S.: On the specification of term values in automatic Indexing. J. Doc. 29(4), 351–372 (1973)

    Article  Google Scholar 

  12. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semamtic analysis. J. Am. Soc. Inf. Sci. 1(6), 391–407 (1990)

    Article  Google Scholar 

  13. Frakes, W.B.: Stemming algorithms. In: Information Retrieval: Data Structure & Algorithms, pp. 131–160. TPR Prentice Hall (1992)

  14. Hyunki, K., Sushing, C.: Associative naïve Bayes classifier: automated linking of gene ontology to medline documents. Pattern Recognit. 42(9), 1777–1785 (2009)

    Article  MATH  Google Scholar 

  15. Joachims, T.: Aprobabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 143–151. Morgan Kaufmann, San Francisco (1997)

  16. Dunning, T.E.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)

    Google Scholar 

  17. Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the Workshop on Speech and Natural Language, pp. 23–26 (1992)

  18. John, G.H., Khavi, R., Pfleger, K.: Irrelevant feature and the subset selection problem. In: Proceedings of the 11th International Conference on Machine Learning, New Jersey, pp. 121–129 (1994)

  19. Yang Y., Pederson J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412-420. Morgan Kaufmann, Nashville (1997)

  20. Mitchell, T.: Machine Learning. McCraw Hill, New York (1996)

    MATH  Google Scholar 

  21. Koller, D., Sahami, M.: Toeard optimal feature selection. In: Proceedings of the Thirteenth International Conference on Machine Learning (1996)

  22. Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retieval of Information by Computer. Addison-Wesley, Reading (1989)

    Google Scholar 

  23. Ying, C., Jiu-Lin, S.: Research on the automatic classification: present situation and prospects. J. China Soc. Sci. Tech. Inf. 1, 20–27 (1999)

    Google Scholar 

  24. Li, Y.H., Jain, A.K.: Classification of text documents. Comput. J. 41(8), 537–546 (1998)

    Article  MATH  Google Scholar 

  25. Lam, W., Ho, C.Y.: Using a generalized instance set for automatic text categorization. In: Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, AU, pp. 81–89 (1998)

  26. Platt, J.: Sequential minimal optimization: A fast algorithm for training support vector machines. In: Advances in Kernel Methods-Support Vector learning, pp. 185–208. MIT Press, Cambridge, MA (1999)

  27. Apte, C., Damerau, F.J., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Trans. Inf. Syst. 12(3), 233–251 (1994)

    Article  Google Scholar 

  28. Schapire, R.E., Singer, Y., Singhal, A.: Boosting and Rocchio applied to text filtering. In: Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, AU, pp. 215–223 (1998)

  29. Mladenic, D., Brank, J., Grobelnik, M., Milic-Frayling, N.: Feature selection using linear classifier weights: interaction with classification models. In: Jarvelin, K., Allan, J., Bruza, P., Sanderson, M. (eds.) Proceedings of the 27th ACM International Conference on Research and Development in Information Retrieval (SIGIR- 04), pp. 234-24. ACM Press, Sheffield (2004)

  30. Aizerman, M., Brave, M.A.N.E., Rozonoer, L.: Theoretical foundations of the Potential function method in pattern recognition learning. Autom. Remote Control 25, 821–837 (1964)

    MATH  Google Scholar 

  31. Gil-Garcia, R., Pons-Porrata, A.: Dynamic hierarchical algorithms for document clustering. Pattern Recognit. Lett. (2009)

  32. Ng, H.T., Goh, W.B., Low, K.L.: Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20th ACM International Conference on Research Development in Information Retrieval, SIGIR-97, pp. 67–73 (1997)

  33. Anaya-Sanchez, H., Pons-Porrata, A., Berlanga-Liavori, R.: A document clustering algorithm for discovering and describing topics. Pattern Recognit. Lett. (2009)

  34. Drewes, B.: Some Industrial applications of text mining. Knowl. Min. 185, 233–238 (2005)

    Article  Google Scholar 

  35. Chu, H.-C., Chen, M.-Y., Chen, Y.-M.: A semantic-based approach to content abstraction and annotation for content management. Expert Syst. Appl. 36(2), 2360–2376 (2009)

    Article  Google Scholar 

Download references

Acknowledgements

This work was financially supported by the National Natural Science Foundation of China (61373067,61672301,61662057), the Science and Technology Innovation Guide Project of Inner Mongolia Autonomous Region of china (2016), the Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education Open Foundation (93K172016K05), the Research Program of science and technology at Universities of Inner Mongolia Autonomous Region of china (NJZY16177), the Philosophy and Social Science Planning Project of Inner Mongolia Autonomous Region of china (2015D033), the Natural Science Foundation of Inner Mongolia Autonomous Region of china (2016MS0624), the Program of Science and Technology Development Plan of Jilin Province (20140101195JC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhili Pei.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Q., Liu, L., Jiang, J. et al. Feature selection method based on multiple centrifuge models. Cluster Comput 20, 1425–1435 (2017). https://doi.org/10.1007/s10586-017-0812-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-0812-9

Keywords

Navigation