Skip to main content
Log in

Automatic document classification based on latent semantic analysis

  • Published:
Programming and Computer Software Aims and scope Submit manuscript

Abstract

In this paper, the problem of automatic document classification by a set of given topics is considered. The method proposed is based on the use of the latent semantic analysis to retrieve semantic dependencies between words. The classification of document is based on these dependencies. The results of experiments performed on the basis of the standard test data set TREC (Text REtrieval Conference) confirm the attractiveness of this approach. The relatively low computational complexity of this method at the classification stage makes it possible to be applied to the classification of document streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ilander, F., Palm, J., and Fahraus, E.,The Private Filtering News Agent, 1997.

  2. Foltz, P.W., Using Latent Semantic Indexing for Information Filtering,Proc. ACM Conf. on Office Information Systems (COIS), 1990, pp. 40–47.

  3. Gallan, J., Learning while Filtering Documents,Proc. SIGIR'98, Melbourne, 1998, pp. 224–231.

  4. Merkl, D., Text Data Mining, inA Handbook of Natural Language Processing: Techniques and Applications for the Processing of Language as Text, New York: Marcel Dekker, 1998.

    Google Scholar 

  5. Weiss, S.A., Kasif, S., and Brill, E.,Text Classification in USENET Newsgroups: A Progress Report.

  6. Daphen, K. and Mehran, S.,Hierarchically Classifying Documents Using Very Few Words.

  7. Lewis, D. and Ringuette, M., A Comparison of Two Learning Algorithms for Text Categorization,Proc. Third Annual Symp. on Document Analysis and Information Retrieval, 1994, pp. 81–93.

  8. Yang, Y. and Pederson, J., Feature Selection in Statistical Learning of Text Categorization,Proc. ICML'97, 1997, pp. 412–420.

  9. Baker, L.D. and McCallum, A.K., Distributional Clustering of Words for Text Classification,Proc. SIGIR'98, 1998, pp. 96–103.

  10. Papka, R. and Allan, J., Document Classification Using Multiword Features,Proc. ACM Int. Conf. on Information and Knowledge Management (CIKM-98), New York, 1998, pp. 124–131.

  11. Merkl, D., Lessons Learned in Text Document Classification,Proc. Workshop on Self-Organizing Maps (WSOM'97), Helsinki, 1997, pp. 316–321.

  12. Landauer, T., Foltz, P., and Laham, D., An Introduction to Latent Semantic Analysis, inDiscourse Processes, vol. 25, pp. 259–284.

  13. Harman, D., Latent Semantic Indexing and TREC-2,Proc. Second Text REtrieval Conf., 1994.

  14. Dumais, S., Latent Semantic Indexing: TREC-3 report,Proc. Third Text REtrieval Conf., 1995.

  15. Cullum, J. and Wilougby, R., Real Rectangular Matrix, inLanczos Algorithms for Large Symmetric Eigenvalue Computations, Boston: Birkhauser, 1985.

    Google Scholar 

  16. Dumais, S.,Improving the Retrieval of Information from External Sources, 1991.

  17. Joachims, T., A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization,Proc. Int. Conf. on Machine Learning (ICML), 1997.

  18. Voorhees, E. and Harman, D., Overview of the Sixth Text REtrieval Conf. (TREC-6),Proc. Sixth Text Retrieval Conference, 1998.

  19. Berry, M., Large Scale Singular Value Computations,Int. J. Supercomputer Appl., 1992, vol. 6, pp. 13–49.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to I. Kuralenok.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kuralenok, I., Nekrest'yanov, I. Automatic document classification based on latent semantic analysis. Program Comput Soft 26, 199–206 (2000). https://doi.org/10.1007/BF02759469

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02759469

Keywords

Navigation