Skip to main content

Text Similarity Computing Based on Standard Deviation

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3644))

Abstract

Automatic text categorization is defined as the task to assign free text documents to one or more predefined categories based on their content. Classical method for computing text similarity is to calculate the cosine value of angle between vectors. In order to improve the categorization performance, this paper puts forward a new algorithm to compute the text similarity based on standard deviation. Experiments on Chinese text documents show the validity and the feasibility of the standard deviation-based algorithm.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fuhr, N., Hartmanna, S., Lustig, G., Schwantner, M., Tzeras, K.: Air/x - a rule-based multistage indexing systems for large subject fields. In: Proceedings of RIAO 1991, pp. 606–623 (1991)

    Google Scholar 

  2. Yang, Y., Chute, C.G.: A Linear Least Squares Fit mapping method for information retrieval from natural language texts. In: Proceedings of 14th International Conference on Computational Linguistics (COLING 1992), vol. II, pp. 447–453 (1992)

    Google Scholar 

  3. Creecy, R.H., Masand, B.M., Smith, S.J., Waltz, D.L.: Trading MIPS and memory for knowledge engineering: classifying census returns on the connection machine. Comm. ACM 35, 48–63 (1992)

    Article  Google Scholar 

  4. Yang, Y., Chute, C.G.: An example-based mapping method for text classification and retrieval. ACM Transactions on Information Systems (TOIS) 12, 253–277 (1994)

    Article  Google Scholar 

  5. Tzeras, K., Hartmann, S.: Automatic Indexing Based on Bayesian Inference Networks. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIDIR 1993), pp. 22–34 (1993)

    Google Scholar 

  6. Lewis, D., Ringuette, M.: A comparison of two learning algorithms for text clas sification. In: Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)

    Google Scholar 

  7. Moulinier, I.: Is learning bias an issue on the text categorization problem? Technical report, LAFORIA-LIP6, Universite Paris VI (1997)

    Google Scholar 

  8. Apte, C., Damerau, F., Weiss, S.: Towards language independent automated learning of text categorization models. In: Proceedings of the Seventeenth Annual International ACM/SIGIR Conference (1994)

    Google Scholar 

  9. Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval(SDAIR 1995) (1995)

    Google Scholar 

  10. Moulinier, I., Raskinis, G., Ganascia, J.: Text categorization: a symbolic approach. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval (1996)

    Google Scholar 

  11. William, W.C., Singer, Y.: Context-sensitive learning methods for text classification. In: SIGIR 1996: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315 (1996)

    Google Scholar 

  12. David, D.L., Robert, E.S., Callan, J.P., Papka, R.: Training Algorithms for Linear Text Classifiers. In: SIGIR 1996: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 298–306 (1996)

    Google Scholar 

  13. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  14. Rocchio, J.: Relevance feedback in information retrieval. In: The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313–323. Prentice Hall Inc., Englewood Cliffs (1971)

    Google Scholar 

  15. Weiss, S.M., Apte, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing Text-Mining Performance. IEEE Intelligent Systems and Their Applications [see also IEEE Intelligent Systems] 14, 63–69 (1999)

    Article  Google Scholar 

  16. Salton, G., Lesk, M.E.: Computer evaluation of Indexing and text processing. Association for Computing Machinery 15, 8–36 (1968)

    MATH  Google Scholar 

  17. Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of ACM 18, 613–620 (1975)

    Article  MATH  Google Scholar 

  18. Yiming, Y., Jan, P.P.: A comparative study on feature selection in text Categorization. In: Proceedings of ICML1997, 14th International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

  19. Tom, M.: Machine Learning. McGraw-Hill, New York (1996)

    MATH  Google Scholar 

  20. Quinlan, J.: Induction of decision trees. Machine Learning 1, 81–106 (1986)

    Google Scholar 

  21. Keeneth, W.C., Patric, H.: Word association norms, mutual information and lexicography. In: Proceeding of ACL, Vancouver, Canada, vol. 27, pp. 76–83 (1989)

    Google Scholar 

  22. Fano, R.: Transmission of Information. MIT Press, Cambrige (1961)

    Google Scholar 

  23. Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network apporach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval(SDAIR 1995) (1995)

    Google Scholar 

  24. Yiming, Y.: Noise Reduction in a Statistical Approach to Text Categorization. In: ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1995), pp. 256–263 (1995)

    Google Scholar 

  25. Salton, G.: Automatic text processing: the transformation analysis and retrieval of information by Computer. Aoldison-wesley, Reading (1989)

    Google Scholar 

  26. Bin, L., Tiejun, H., Jun, C., Wen, G.: A New Statistical-based Method in Automatic Text Classification. Journal of Chinese information processing 16, 18–24 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, T., Guo, J. (2005). Text Similarity Computing Based on Standard Deviation. In: Huang, DS., Zhang, XP., Huang, GB. (eds) Advances in Intelligent Computing. ICIC 2005. Lecture Notes in Computer Science, vol 3644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11538059_48

Download citation

  • DOI: https://doi.org/10.1007/11538059_48

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28226-6

  • Online ISBN: 978-3-540-31902-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics