Skip to main content

Advertisement

Log in

Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper proposes an improved latent semantic analysis (LSA) model to represent textual document and takes advantage of a fuzzy logic based genetic algorithm (FLGA) for clustering. The standard genetic algorithm (GA) in conventional vector space model is rather difficult to deal with because the high dimensional encoding of GA makes it explore the optimal solution in a complicated space which is prone to cause an overflow problem. The LSA-based corpus model not only reduces the dimensions drastically, but also creates an underlying semantic structure which enhances its ability of distinguishing documents in terms of concepts and indirectly improves the ability of GA for clustering (genetic clustering). A novel FLGA is proposed in conjunction with this semantic model in this study. According to the nature of biological evolution, several fuzzy controllers are given to adaptively adjust and optimize the behaviors of the GA which can effectively prevent the premature convergence to a suboptimum solution. The experiment results show that the fuzzy logic controllers enhance the ability of the GA to explore the global optimum solution, and the utilization of the LSA-based text representation method to FLGA further improves its clustering performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Antanas Z, Aurelija P (2003) On multimodality of the SSTRESS criterion for metric multidimensional scaling. Informatica 14(1): 121–130

    MATH  Google Scholar 

  2. Bandyopadhyay S, Maulik U (2001) Nonparametric genetic clustering: comparison of validity indices. IEEE Trans Syst Man Cybern-C Appl Rev 31(1): 120–125

    Article  Google Scholar 

  3. Bandyopadhyay S, Pal SK (2004) Multi-objective GAs, quantitative indices and pattern classification. IEEE Trans Syst Man Cybern-B 34(5): 2088–2099

    Article  Google Scholar 

  4. Bellegarda J, Butzberger J, Chow Y (1996) A novel word clustering algorithm based on latent semantic analysis. In: Proceedings of the international conference on acoustics, speech and signal processing (ICASSP-96), pp 172–175

  5. Berry MW, Dumais ST, Brien GW (1995) Using linear algebra for intelligent information retrieval. SIAM Rev 37(4): 573–595

    Article  MATH  MathSciNet  Google Scholar 

  6. Chen KK, Liu L (2008) “Best K” critical clustering structures in categorical datasets. Knowl Inform Syst (in press)

  7. David AG, Ophir F (2004) Information retrieval: algorithms and heuristics, 2nd edn. Springer, Berlin. ISBN 1-4020-3004-5

    Google Scholar 

  8. Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1: 224–227

    Article  Google Scholar 

  9. Deerwester S, Dumais S, Landauer T et al (1990) Indexing by latent semantic analysis. J Am Soc Inform Sci 41(6): 391–407

    Article  Google Scholar 

  10. Fragoudis D, Meretakis D, Likothanassis S (2005) Best terms: an efficient feature-selection algorithm for text categorization. Knowl Inform Syst 8: 16–33

    Article  Google Scholar 

  11. Frigui H, Krishnapuram R (1999) A robust competitive clustering algorithm with application in computer vision. IEEE Trans Pattern Anal Mach Intell 21(1): 450–465

    Article  Google Scholar 

  12. Keogh E, Chakrabarti K, Pazzani M et al (2001) Dimensionality reduction for fast similarity search in large time series databases. Knowl Inform Syst 3: 263–286

    Article  MATH  Google Scholar 

  13. Koontz W, Narendra P, Fukunaga K (1975) A branch and bound clustering algorithm. IEEE Trans Comput C-24: 908–915

    Article  MathSciNet  Google Scholar 

  14. Koontz W, Narendra P, Fucunaga K (1975) A graph theoretic approach to nonparametric cluster analysis. IEEE Trans Comput C-25: 936–944

    Article  Google Scholar 

  15. Lee C, Yao X (2004) Evolutionary programming using mutations based on the Levy probability distribution. IEEE Trans Evol Comput 8(1): 1–13

    Article  Google Scholar 

  16. Li T (2007) Clustering based on matrix approximation: a unifying view. Knowl Inform Syst 17: 1–15

    Article  Google Scholar 

  17. Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recognit 33(9): 1455–1465

    Article  Google Scholar 

  18. Michael WB, Murray B (1999) Understanding search engines: mathematical modeling and text retrieval. Society for Industrial and Applied Mathematics (SIAM), Philadelphia. ISBN 0-89871-437-0

  19. Noorinaeini A, Lehto MR (2006) Hybrid singular value decomposition: a model of human text classification. Int J Hum Factors Model Simul 1(1): 95–118

    Article  Google Scholar 

  20. Porter MF (1980) An algorithm for suffixstripping. Program 14(3): 130–137

    Google Scholar 

  21. Ricardo BY, Berthier RN (1999) Modern information retrieval. ACM Press, Addison-Wesley, New York ISBN 0-201-39829-X

    Google Scholar 

  22. Salmeron M, Ortega J, Puntonet CG et al (2001) Improved RAN sequential prediction using orthogonal techniques. Neurocomputing 49: 153–172

    Article  Google Scholar 

  23. Savio LY, Lee DL (1999) Feature reduction for neural network based text categorization. In: Proceedings of the 6th IEEE international conference on database advanced systems for advanced application, pp 195–202

  24. Selim S, Ismail M (1984) K-means-type algorithm: generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 6: 81–87

    Article  MATH  Google Scholar 

  25. Shepard R (1987) Towards a universal law of generalization for psychological science. Science 237(4820): 1317–1323

    Article  MathSciNet  Google Scholar 

  26. Song W, Park SC (2006) Genetic algorithm-based text clustering technique. Lecture note in computer science, vol 4221. Springer, Berlin, pp 779–782

  27. Srinivas M, Patnaik LM (1994) Adaptive Probabilities of Crossover and Mutation in Genetic Algorithms. IEEE Trans Syst Man Cybern 24(4): 656–667

    Article  Google Scholar 

  28. Sun JT, Chen Z, Zeng HJ et al (2004) Supervised latent semantic indexing for document categorization. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 535–538

  29. Tarazaga P, Trosset M (1998) An approximate solution to the metric SSTRESS problem in multidimensional scaling. Comput Sci Stat 30(1): 292–295

    Google Scholar 

  30. Vizine AL, Castro LN, Hruschkal ER et al (2005) Towards improving clustering ants: an adaptive ant clustering algorithm. Informatica 29: 143–154

    MATH  Google Scholar 

  31. Vozalis MG, Margaritis KG (2007) Using SVD and demographic data for the enhancement of generalized collaborative filtering. Inform Sci 177: 3017–3037

    Article  Google Scholar 

  32. Wu XD, Kumar V, Quinlan JR et al (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14: 1–37

    Article  Google Scholar 

  33. Xia HX, Wang SG, Yoshida T (2006) A modified ant-based text clustering algorithm with semantic similarity measure. J Syst Sci Syst Eng 15(4): 474–492

    Article  Google Scholar 

  34. Yany Y (1995) Noise reduction in a statistical approach to text categorization. In: Proceedings of the 18th ACM international conference on research and development in information retrieval, pp 256–263

  35. Yao X, Liu Y, Lin G (1999) Evolutionary programming made faster. IEEE Trans Evol Comput 3(2): 82–102

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, W., Park, S.C. Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering. Knowl Inf Syst 22, 347–369 (2010). https://doi.org/10.1007/s10115-009-0191-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0191-5

Keywords

Navigation