Abstract
Short text clustering is an unsupervised learning technique for pattern discovery and analysis of short text datasets, which has been applied to many scenarios such as business risk control and audit. With the development of digitalization over the last few years, the data scale in various scenarios has increased rapidly. Traditional short text clustering methods such as K-means face many challenges in large-scale data analysis, such as difficult to preset hyperparameters and high computational complexity. To alleviate this problem, we propose a novel clustering algorithm called Word Hash clustering algorithm (W-Hash) for Chinese short text analysis. Specifically, W-Hash does not require a pre-specified number of clusters, and it has much lower computational complexity than the traditional clustering approaches. To verify the effectiveness of W-Hash, we apply it to solve a real-life business audit problem. The corresponding experimental results show that W-Hash outperforms traditional clustering algorithms in both training time and result rationality.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Sharma, K.K., Seal, A.: Clustering analysis using an adaptive fused distance. Eng. Appl. Artif. Intell. 96, 103928 (2020)
Tian, T., Zhang, J., Lin, X., Wei, Z., Hakonarson, H.: Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data. Nat. Commun. 12(1), 1–12 (2021)
Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Ann. Data Sci. 2(2), 165–193 (2015). https://doi.org/10.1007/s40745-015-0040-1
Ahmed, M., Seraj, R., Islam, S.M.S.: The k-means algorithm: a comprehensive survey and performance evaluation. Electronics 9(8), 1295 (2020)
Khan, K., Rehman, S.U., Aziz, K., Fong, S., Sarasvady, S.: DBSCAN: past, present and future. In: The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014), pp. 232–238, February 2014
Jia, C., Carson, M.B., Wang, X., Yu, J.: Concept decompositions for short text clustering by identifying word communities. Pattern Recogn. 76, 691–703 (2018)
Wan, H., Ning, B., Tao, X., Long, J.: Research on Chinese short text clustering ensemble via convolutional neural networks. In: Liang, Q., Wang, W., Jiasong, Mu., Liu, X., Na, Z., Chen, B. (eds.) Artificial Intelligence in China. LNEE, vol. 572, pp. 622–628. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-0187-6_74
Hao, M., Xu, B., Liang, J.Y., Zhang, B.W., Yin, X.C.: Chinese short text classification with mutual-attention convolutional neural networks. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP) 19(5), 1–13 (2020)
Dai, D., et al.: An inception convolutional autoencoder model for Chinese healthcare question clustering. IEEE Trans. Cybern. 51(4), 2019–2031 (2019)
Geng, X., Zhang, Y., Jiao, Y., Mei, Y.: A novel hybrid clustering algorithm for topic detection on Chinese microblogging. IEEE Trans. Comput. Soc. Syst. 6(2), 289–300 (2019)
Chen, J., Gong, Z., Liu, W.: A Dirichlet process biterm-based mixture model for short text stream clustering. Appl. Intell. 50(5), 1609–1619 (2020). https://doi.org/10.1007/s10489-019-01606-1
Zamora, J., Mendoza, M., Allende, H.: Hashing-based clustering in high dimensional data. Expert Syst. Appl. 62, 202–211 (2016)
Cao, W., Yang, P., Ming, Z., Cai, S., Zhang, J.: An improved fuzziness based random vector functional link network for liver disease detection. In: 2020 IEEE 6th International Conference on Big Data Security on Cloud (BigDataSecurity), pp. 42–48, May 2020
Patwary, M.J., Cao, W., Wang, X.Z., Haque, M.A.: Fuzziness based semi-supervised multimodal learning for patient’s activity recognition using RGBDT videos. Appl. Soft Comput. 120, 108655 (2022)
Tang, G., et al.: A comparative study of neural network techniques for automatic software vulnerability detection. In: 2020 International Symposium on Theoretical Aspects of Software Engineering (TASE), pp. 1–8, December 2020
Acknowledgment
This work was supported by National Natural Science Foundation of China (Grant No. 62106150), CAAC Key Laboratory of Civil Aviation Wide Surveillance and Safety Operation Management and Control Technology (Grant No. 202102), and CCF-NSFOCUS (Grant No. 2021001).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, Y., Zhang, C., Ye, L., Peng, X., Qiu, M., Cao, W. (2022). W-Hash: A Novel Word Hash Clustering Algorithm for Large-Scale Chinese Short Text Analysis. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds) Knowledge Science, Engineering and Management. KSEM 2022. Lecture Notes in Computer Science(), vol 13370. Springer, Cham. https://doi.org/10.1007/978-3-031-10989-8_42
Download citation
DOI: https://doi.org/10.1007/978-3-031-10989-8_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10988-1
Online ISBN: 978-3-031-10989-8
eBook Packages: Computer ScienceComputer Science (R0)