Skip to main content

Advertisement

Log in

A new unsupervised feature selection method for text clustering based on genetic algorithms

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Nowadays a vast amount of textual information is collected and stored in various databases around the world, including the Internet as the largest database of all. This rapidly increasing growth of published text means that even the most avid reader cannot hope to keep up with all the reading in a field and consequently the nuggets of insight or new knowledge are at risk of languishing undiscovered in the literature. Text mining offers a solution to this problem by replacing or supplementing the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information. Text clustering is one of the most important areas in text mining, which includes text preprocessing, dimension reduction by selecting some terms (features) and finally clustering using selected terms. Feature selection appears to be the most important step in the process. Conventional unsupervised feature selection methods define a measure of the discriminating power of terms to select proper terms from corpus. However up to now the valuation of terms in groups has not been investigated in reported works. In this paper a new and robust unsupervised feature selection approach is proposed that evaluates terms in groups. In addition a new Modified Term Variance measuring method is proposed for evaluating groups of terms. Furthermore a genetic based algorithm is designed and implemented for finding the most valuable groups of terms based on the new measure. These terms then will be utilized to generate the final feature vector for the clustering process . In order to evaluate and justify our approach the proposed method and also a conventional term variance method are implemented and tested using corpus collection Reuters-21578. For a more accurate comparison, methods have been tested on three corpuses and for each corpus clustering task has been done ten times and results are averaged. Results of comparing these two methods are very promising and show that our method produces better average accuracy and F1-measure than the conventional term variance method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Bao, J., Shen, J., Liu, X., & Song, Q. (2003). A new text feature extraction model and its application in document copy detection. In Proceedings of the second international conference on machine learning and cybernetics, IEEE (pp. 82–87).

  • Basu, A., Watters, C., & Shepherd, M. (2002). Support vector machines for text categorization. In Proceedings of the 36th Hawaii international conference on system sciences, IEEE.

  • Beasley, D., Bull, D., & Martin, R. (1993). A sequential niche technique for multimodal function optimization. Evolutionary Computation Journal, 101–105.

  • Buddeewong, S., & Worapoj, K. (2005). A new association rule-based text classifier algorithm. In Proceedings of the 17th IEEE international conference on tools with artificial intelligence (pp. 684–685).

  • Coley, D. (1999). An introduction to genetic algorithms for scientists and engineers. World Scientific.

  • Goldberg, D. (1989). Genetic algorithms in search, optimization and machine learning. Kluwer Academic Publishers.

  • Hung, C., & Wermter, S. (2003). A dynamic adaptive self-organising hybrid model for text clustering. In Proceedings of the third IEEE international conference on data mining (ICDM’03) (pp. 75–83).

  • Jain, G., Ginwala, A., & Aslandogan, Y. (2004). An approach to text classification using dimensionality reduction and combination of classifiers. In Proceedings of the IEEE international conference on information reuse and integration (IRI) (pp. 564–569).

  • Kuntraruk, J., & Pottenger, M. (2001). Massively parallel distributed feature extraction in textual data mining using HDDI. In Proceedings of the 10th IEEE international symposium on high performance distributed computing (pp. 363–370).

  • Lee, C., Yang, H., & Ma, S. (2006). A novel multilingual text categorization system using latent semantic indexing. In Proceedings of the first international conference on innovative computing, information and control, IEEE.

  • Liu, L., Kang, J., Yu, J., & Wang, Z. (2005). A comparative study on unsupervised feature selection methods for text Clustering. In Proceeding of NLP-KE (Vol. 9, pp. 597–601).

  • Massey, L. (2005). Real-world text clustering with adaptive resonance theory neural networks. In Proceedings of IEEE international joint conference on neural networks (IJCNN’05) (Vol. 5, pp. 2748–2753).

  • Miller, T. (2005). Data and text mining a business applications approach. New York: Prentice Hall.

    Google Scholar 

  • Mitchell, T. (1997). Machine learning. Washington: McGraw-Hill.

    Google Scholar 

  • Porter algorithm (2007). http://tartarus.org/~martin/PorterStemmer.

  • Prabowo, R., & Thelwall, M. (2006). A comparison of feature selection methods for an evolving RSS feed corpus. In Information processing and management J (Vol. 42, pp. 1491–1512).

  • Reuters-21578 text collection (2007). http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.

  • Shang, W., Qu, Y., Haiban, Z., Houkuan, H., Yongmin, L., & Hongbin, D. (2006). An adaptive fuzzy knn text classifier based on gini index weight. In Proceedings of the 11th IEEE international symposium computers and computers and communications (ISCC’06).

  • Song, W., & Park, S. (2006). Genetic algorithm-based text clustering technique: Automatic evolution of clusters with high efficiency. In Proceedings of the seventh international conference on Web-Age information management workshops (WAIMW) (pp. 17–25).

  • Song, W., & Park, S. (2009). Genetic algorithm for text clustering based on latent semantic indexing. Computers & Mathematics with Applications, 57(11–12):1901–1907.

    Article  Google Scholar 

  • Sullivan, D. (2001). Document warehousing and text mining. New York: Wiley.

    Google Scholar 

  • Sun, F., & Sun, M. (2005). A new transductive support vector machine approach to text categorization. In Proceedings of NLP-KE, IEEE (pp. 631–635).

  • Wang, B., & Zhang, S. (2005). A novel text classification algorithm based on naïve bayes and KL-divergence. In Proceedings of the sixth international conference on parallel and distributed computing, applications and technologies (PDCAT’05), IEEE.

  • Xu, J., & Wang, Z. (2004) A new method of text categorization based on PA and kohonen network. In Proceedings of third international conference on machine learning and cybernetics, Shanghai (pp. 1324–1328).

  • Yang, Y., & Pedersen, J. (1997). A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th international conference on machine learning (pp. 412–420).

  • Yang, S., Wu, X., Deng, Z., Zhang, M., & Yang, D. (2002). Relative term-frequency based feature selection for text categorization. In Proceedings of the first international conference on machine learning and cybernetics, Beijing, IEEE (pp. 1432–1436).

Download references

Acknowledgement

We thank Pete Blindell for proofreading the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Saraee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shamsinejadbabki, P., Saraee, M. A new unsupervised feature selection method for text clustering based on genetic algorithms. J Intell Inf Syst 38, 669–684 (2012). https://doi.org/10.1007/s10844-011-0172-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-011-0172-5

Keywords

Navigation