Abstract
This article proposes the modified AHC (Agglomerative Hierarchical Clustering) algorithm which clusters string vectors, instead of numerical vectors, as the approach to the text clustering. The results from applying the string vector based algorithms to the text clustering were successful in previous works and synergy effect between the text clustering and the word clustering is expected by combining them with each other; the two facts become motivations for this research. In this research, we define the operation on string vectors called semantic similarity, and modify the AHC algorithm by adopting the proposed similarity metric as the approach to the text clustering. The proposed AHC algorithm is empirically validated as the better approach in clustering texts in news articles and opinions. We need to define and characterize mathematically more operations on string vectors for modifying more advanced machine learning algorithms.
Similar content being viewed by others
References
Abainia, K., Ouamour, S., Sayoud, H.: Neural text categorizer for topic identification of noisy arabic texts. In: Proceedings of 12th IEEE Conference on Computer Systems and Applications, pp. 1–8 (2015)
Ah-Pine, J., Wang, X.: Similarity based hierarchical clustering with an application to text collections. In: Proceedings of International Symposium on Intelligent Data Analysis, pp. 320–331 (2016)
Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suha, E., Doughertya, E.R.: Model-based evaluation of clustering validation measures. Pattern Recogn 40, 807–824 (2007)
Dhillon, I.S., Mallela, S., Kumar, R.: Enhanced word clustering for hierarchical text classification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 191–200 (2002)
Gamare, P.S., Patil, G.A.: Web document clustering using hybrid app roach in data mining. Int. J. Advent Technol. 3(7), 92–87 (2015)
Gao, H., Jiang, J., She, L., Fu, Y.: A new agglomerative hierarchical clustering algorithm implementation based on the map reduce framework. Int. J. Digital Content Technol. Appl. 4(3), 95–100 (2010)
Jo, T.: NeuroTextCategorizer: a new model of neural network for text categorization. The Proceedings of ICONIP, pp. 280–285 (2000)
Jo, T.: The implementation of dynamic document organization using text categorization and text clustering. PhD Dissertation of University of Ottawa (2006)
Jo, T.: Table based single pass algorithm for clustering news articles. Int. J. Fuzzy Logic Intell. Syst. 8(3), 231–237 (2008)
Jo, T.: Neural text categorizer for exclusive text categorization. J. Inform. Process. Syst. 4(2), 77–86 (2008)
Jo, T.: Modification of classification algorithm in favor of text categorization. Int. J. Comput. Sci. Softw. Technol. 2(1), 13–23 (2009)
Jo, T.: Modification of clustering algorithms for text clustering. Int. J. Comput. Sci. Softw. Technol. 3(1), 21–33 (2010)
Jo, T.: NTC (neural text categorizer): Neural network for text categorization. Int. J. Inform. Stud. 2(2), 83–96 (2010)
Jo, T.: NTSO (neural text self organizer): a new neural network for text clustering. J. Netw. Technol. 1(1), 31–43 (2010)
Jo, T.: Device and method for categorizing electronic document automatically, 10-2009-0041272 10-1071495 (2011)
Jo, T.: Normalized table matching algorithm as App roach to text categorization. Soft Comput. 19(4), 839–849 (2015)
Jo, T.: Simulation of numerical semantic operations on string in text collection. Int. J. Appl. Eng. Res. 10(24), 45585–45591 (2015)
Jo, T., Cho, D.: Index based approach for text categorization. Int. J. Math. Comput. Simul. 2, 127–132 (2008)
Jo, T., Japkowicz, N.: Text clustering using NTSO. In: The Proceedings of IJCNN, pp. 558–563 (2005)
Jo, T., Lee, M.: The evaluation measure of text clustering for the variable number of clusters. Lect. Notes Comput. Sci. 4492, 871–879 (2007)
Kate, R.J., Mooney, R.J.: Using string kernels for learning semantic parsers. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 913–920 (2006)
Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification with string kernels. J. Mach. Learn. Res. 2(2), 419–444 (2002)
Pawar, P.Y., Gawande, S.H.: A comparative study on different types of approaches to text categorization. Int. J. Mach. Learn. Comput. 2, 4 (2012)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv., 1–47 (2002)
Slonim, N., Tishby, N.: The power of word clusters for text classification. In: Proceedings of 23rd European Colloquium on Information Retrieval Research, pp. 200–200 (2001)
Wiener, E.D.: A Neural Network Approach to Topic Spotting in Text. Master Thesis the Faculty of the Graduate School of the University of Colorado (1995)
Yang, Y.: An evaluation of statistical approaches to text categorization. Inform. Retriev. 1(1), 69–90 (1999)
Zheng, Y., Cheng, X., Huang, R., Man, Y.: A comparative study on text clustering methods. Adv. Data Mining Appl., 644–651 (2006)
Zhou, E., Zhong, N., Li, Y., Huang, J.: Hot topic detection in news blog based on W2T methodology. In: Proceedings of International Conference on Wisdom Web of Things, pp. 237–258 (2016)
Acknowledgements
This work was supported by 2019 Hongik University Research Fund.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jo, T. Semantic string operation for specializing AHC algorithm for text clustering. Ann Math Artif Intell 88, 1083–1100 (2020). https://doi.org/10.1007/s10472-019-09687-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10472-019-09687-x