Abstract
This paper proposes an improved latent semantic analysis (LSA) model to represent textual document and takes advantage of a fuzzy logic based genetic algorithm (FLGA) for clustering. The standard genetic algorithm (GA) in conventional vector space model is rather difficult to deal with because the high dimensional encoding of GA makes it explore the optimal solution in a complicated space which is prone to cause an overflow problem. The LSA-based corpus model not only reduces the dimensions drastically, but also creates an underlying semantic structure which enhances its ability of distinguishing documents in terms of concepts and indirectly improves the ability of GA for clustering (genetic clustering). A novel FLGA is proposed in conjunction with this semantic model in this study. According to the nature of biological evolution, several fuzzy controllers are given to adaptively adjust and optimize the behaviors of the GA which can effectively prevent the premature convergence to a suboptimum solution. The experiment results show that the fuzzy logic controllers enhance the ability of the GA to explore the global optimum solution, and the utilization of the LSA-based text representation method to FLGA further improves its clustering performance.
Similar content being viewed by others
References
Antanas Z, Aurelija P (2003) On multimodality of the SSTRESS criterion for metric multidimensional scaling. Informatica 14(1): 121–130
Bandyopadhyay S, Maulik U (2001) Nonparametric genetic clustering: comparison of validity indices. IEEE Trans Syst Man Cybern-C Appl Rev 31(1): 120–125
Bandyopadhyay S, Pal SK (2004) Multi-objective GAs, quantitative indices and pattern classification. IEEE Trans Syst Man Cybern-B 34(5): 2088–2099
Bellegarda J, Butzberger J, Chow Y (1996) A novel word clustering algorithm based on latent semantic analysis. In: Proceedings of the international conference on acoustics, speech and signal processing (ICASSP-96), pp 172–175
Berry MW, Dumais ST, Brien GW (1995) Using linear algebra for intelligent information retrieval. SIAM Rev 37(4): 573–595
Chen KK, Liu L (2008) “Best K” critical clustering structures in categorical datasets. Knowl Inform Syst (in press)
David AG, Ophir F (2004) Information retrieval: algorithms and heuristics, 2nd edn. Springer, Berlin. ISBN 1-4020-3004-5
Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1: 224–227
Deerwester S, Dumais S, Landauer T et al (1990) Indexing by latent semantic analysis. J Am Soc Inform Sci 41(6): 391–407
Fragoudis D, Meretakis D, Likothanassis S (2005) Best terms: an efficient feature-selection algorithm for text categorization. Knowl Inform Syst 8: 16–33
Frigui H, Krishnapuram R (1999) A robust competitive clustering algorithm with application in computer vision. IEEE Trans Pattern Anal Mach Intell 21(1): 450–465
Keogh E, Chakrabarti K, Pazzani M et al (2001) Dimensionality reduction for fast similarity search in large time series databases. Knowl Inform Syst 3: 263–286
Koontz W, Narendra P, Fukunaga K (1975) A branch and bound clustering algorithm. IEEE Trans Comput C-24: 908–915
Koontz W, Narendra P, Fucunaga K (1975) A graph theoretic approach to nonparametric cluster analysis. IEEE Trans Comput C-25: 936–944
Lee C, Yao X (2004) Evolutionary programming using mutations based on the Levy probability distribution. IEEE Trans Evol Comput 8(1): 1–13
Li T (2007) Clustering based on matrix approximation: a unifying view. Knowl Inform Syst 17: 1–15
Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recognit 33(9): 1455–1465
Michael WB, Murray B (1999) Understanding search engines: mathematical modeling and text retrieval. Society for Industrial and Applied Mathematics (SIAM), Philadelphia. ISBN 0-89871-437-0
Noorinaeini A, Lehto MR (2006) Hybrid singular value decomposition: a model of human text classification. Int J Hum Factors Model Simul 1(1): 95–118
Porter MF (1980) An algorithm for suffixstripping. Program 14(3): 130–137
Ricardo BY, Berthier RN (1999) Modern information retrieval. ACM Press, Addison-Wesley, New York ISBN 0-201-39829-X
Salmeron M, Ortega J, Puntonet CG et al (2001) Improved RAN sequential prediction using orthogonal techniques. Neurocomputing 49: 153–172
Savio LY, Lee DL (1999) Feature reduction for neural network based text categorization. In: Proceedings of the 6th IEEE international conference on database advanced systems for advanced application, pp 195–202
Selim S, Ismail M (1984) K-means-type algorithm: generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 6: 81–87
Shepard R (1987) Towards a universal law of generalization for psychological science. Science 237(4820): 1317–1323
Song W, Park SC (2006) Genetic algorithm-based text clustering technique. Lecture note in computer science, vol 4221. Springer, Berlin, pp 779–782
Srinivas M, Patnaik LM (1994) Adaptive Probabilities of Crossover and Mutation in Genetic Algorithms. IEEE Trans Syst Man Cybern 24(4): 656–667
Sun JT, Chen Z, Zeng HJ et al (2004) Supervised latent semantic indexing for document categorization. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 535–538
Tarazaga P, Trosset M (1998) An approximate solution to the metric SSTRESS problem in multidimensional scaling. Comput Sci Stat 30(1): 292–295
Vizine AL, Castro LN, Hruschkal ER et al (2005) Towards improving clustering ants: an adaptive ant clustering algorithm. Informatica 29: 143–154
Vozalis MG, Margaritis KG (2007) Using SVD and demographic data for the enhancement of generalized collaborative filtering. Inform Sci 177: 3017–3037
Wu XD, Kumar V, Quinlan JR et al (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14: 1–37
Xia HX, Wang SG, Yoshida T (2006) A modified ant-based text clustering algorithm with semantic similarity measure. J Syst Sci Syst Eng 15(4): 474–492
Yany Y (1995) Noise reduction in a statistical approach to text categorization. In: Proceedings of the 18th ACM international conference on research and development in information retrieval, pp 256–263
Yao X, Liu Y, Lin G (1999) Evolutionary programming made faster. IEEE Trans Evol Comput 3(2): 82–102
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Song, W., Park, S.C. Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering. Knowl Inf Syst 22, 347–369 (2010). https://doi.org/10.1007/s10115-009-0191-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0191-5