Skip to main content
Log in

Semantic string operation for specializing AHC algorithm for text clustering

  • Published:
Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Abstract

This article proposes the modified AHC (Agglomerative Hierarchical Clustering) algorithm which clusters string vectors, instead of numerical vectors, as the approach to the text clustering. The results from applying the string vector based algorithms to the text clustering were successful in previous works and synergy effect between the text clustering and the word clustering is expected by combining them with each other; the two facts become motivations for this research. In this research, we define the operation on string vectors called semantic similarity, and modify the AHC algorithm by adopting the proposed similarity metric as the approach to the text clustering. The proposed AHC algorithm is empirically validated as the better approach in clustering texts in news articles and opinions. We need to define and characterize mathematically more operations on string vectors for modifying more advanced machine learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abainia, K., Ouamour, S., Sayoud, H.: Neural text categorizer for topic identification of noisy arabic texts. In: Proceedings of 12th IEEE Conference on Computer Systems and Applications, pp. 1–8 (2015)

  2. Ah-Pine, J., Wang, X.: Similarity based hierarchical clustering with an application to text collections. In: Proceedings of International Symposium on Intelligent Data Analysis, pp. 320–331 (2016)

  3. Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suha, E., Doughertya, E.R.: Model-based evaluation of clustering validation measures. Pattern Recogn 40, 807–824 (2007)

    Article  MATH  Google Scholar 

  4. Dhillon, I.S., Mallela, S., Kumar, R.: Enhanced word clustering for hierarchical text classification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 191–200 (2002)

  5. Gamare, P.S., Patil, G.A.: Web document clustering using hybrid app roach in data mining. Int. J. Advent Technol. 3(7), 92–87 (2015)

    Google Scholar 

  6. Gao, H., Jiang, J., She, L., Fu, Y.: A new agglomerative hierarchical clustering algorithm implementation based on the map reduce framework. Int. J. Digital Content Technol. Appl. 4(3), 95–100 (2010)

    Article  Google Scholar 

  7. Jo, T.: NeuroTextCategorizer: a new model of neural network for text categorization. The Proceedings of ICONIP, pp. 280–285 (2000)

  8. Jo, T.: The implementation of dynamic document organization using text categorization and text clustering. PhD Dissertation of University of Ottawa (2006)

  9. Jo, T.: Table based single pass algorithm for clustering news articles. Int. J. Fuzzy Logic Intell. Syst. 8(3), 231–237 (2008)

    Article  Google Scholar 

  10. Jo, T.: Neural text categorizer for exclusive text categorization. J. Inform. Process. Syst. 4(2), 77–86 (2008)

    Article  Google Scholar 

  11. Jo, T.: Modification of classification algorithm in favor of text categorization. Int. J. Comput. Sci. Softw. Technol. 2(1), 13–23 (2009)

    Google Scholar 

  12. Jo, T.: Modification of clustering algorithms for text clustering. Int. J. Comput. Sci. Softw. Technol. 3(1), 21–33 (2010)

    MathSciNet  Google Scholar 

  13. Jo, T.: NTC (neural text categorizer): Neural network for text categorization. Int. J. Inform. Stud. 2(2), 83–96 (2010)

    Google Scholar 

  14. Jo, T.: NTSO (neural text self organizer): a new neural network for text clustering. J. Netw. Technol. 1(1), 31–43 (2010)

    Google Scholar 

  15. Jo, T.: Device and method for categorizing electronic document automatically, 10-2009-0041272 10-1071495 (2011)

  16. Jo, T.: Normalized table matching algorithm as App roach to text categorization. Soft Comput. 19(4), 839–849 (2015)

    Article  Google Scholar 

  17. Jo, T.: Simulation of numerical semantic operations on string in text collection. Int. J. Appl. Eng. Res. 10(24), 45585–45591 (2015)

    Google Scholar 

  18. Jo, T., Cho, D.: Index based approach for text categorization. Int. J. Math. Comput. Simul. 2, 127–132 (2008)

    Google Scholar 

  19. Jo, T., Japkowicz, N.: Text clustering using NTSO. In: The Proceedings of IJCNN, pp. 558–563 (2005)

  20. Jo, T., Lee, M.: The evaluation measure of text clustering for the variable number of clusters. Lect. Notes Comput. Sci. 4492, 871–879 (2007)

    Article  Google Scholar 

  21. Kate, R.J., Mooney, R.J.: Using string kernels for learning semantic parsers. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 913–920 (2006)

  22. Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)

    Article  Google Scholar 

  23. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification with string kernels. J. Mach. Learn. Res. 2(2), 419–444 (2002)

    MATH  Google Scholar 

  24. Pawar, P.Y., Gawande, S.H.: A comparative study on different types of approaches to text categorization. Int. J. Mach. Learn. Comput. 2, 4 (2012)

    Google Scholar 

  25. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv., 1–47 (2002)

  26. Slonim, N., Tishby, N.: The power of word clusters for text classification. In: Proceedings of 23rd European Colloquium on Information Retrieval Research, pp. 200–200 (2001)

  27. Wiener, E.D.: A Neural Network Approach to Topic Spotting in Text. Master Thesis the Faculty of the Graduate School of the University of Colorado (1995)

  28. Yang, Y.: An evaluation of statistical approaches to text categorization. Inform. Retriev. 1(1), 69–90 (1999)

    Article  Google Scholar 

  29. Zheng, Y., Cheng, X., Huang, R., Man, Y.: A comparative study on text clustering methods. Adv. Data Mining Appl., 644–651 (2006)

  30. Zhou, E., Zhong, N., Li, Y., Huang, J.: Hot topic detection in news blog based on W2T methodology. In: Proceedings of International Conference on Wisdom Web of Things, pp. 237–258 (2016)

Download references

Acknowledgements

This work was supported by 2019 Hongik University Research Fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Taeho Jo.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jo, T. Semantic string operation for specializing AHC algorithm for text clustering. Ann Math Artif Intell 88, 1083–1100 (2020). https://doi.org/10.1007/s10472-019-09687-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10472-019-09687-x

Keywords

Mathematics Subject Classification (2010)

Navigation