W-Hash: A Novel Word Hash Clustering Algorithm for Large-Scale Chinese Short Text Analysis

Chen, Yaofeng; Zhang, Chunyang; Ye, Long; Peng, Xiaogang; Qiu, Meikang; Cao, Weipeng

doi:10.1007/978-3-031-10989-8_42

W-Hash: A Novel Word Hash Clustering Algorithm for Large-Scale Chinese Short Text Analysis

Yaofeng Chen¹²,
Chunyang Zhang¹³,
Long Ye¹²,
Xiaogang Peng¹²,
Meikang Qiu¹⁴ &
…
Weipeng Cao¹²

Conference paper
First Online: 19 July 2022

1789 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13370))

Abstract

Short text clustering is an unsupervised learning technique for pattern discovery and analysis of short text datasets, which has been applied to many scenarios such as business risk control and audit. With the development of digitalization over the last few years, the data scale in various scenarios has increased rapidly. Traditional short text clustering methods such as K-means face many challenges in large-scale data analysis, such as difficult to preset hyperparameters and high computational complexity. To alleviate this problem, we propose a novel clustering algorithm called Word Hash clustering algorithm (W-Hash) for Chinese short text analysis. Specifically, W-Hash does not require a pre-specified number of clusters, and it has much lower computational complexity than the traditional clustering approaches. To verify the effectiveness of W-Hash, we apply it to solve a real-life business audit problem. The corresponding experimental results show that W-Hash outperforms traditional clustering algorithms in both training time and result rationality.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Sharma, K.K., Seal, A.: Clustering analysis using an adaptive fused distance. Eng. Appl. Artif. Intell. 96, 103928 (2020)
Article Google Scholar
Tian, T., Zhang, J., Lin, X., Wei, Z., Hakonarson, H.: Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data. Nat. Commun. 12(1), 1–12 (2021)
Article Google Scholar
Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Ann. Data Sci. 2(2), 165–193 (2015). https://doi.org/10.1007/s40745-015-0040-1
Article MathSciNet Google Scholar
Ahmed, M., Seraj, R., Islam, S.M.S.: The k-means algorithm: a comprehensive survey and performance evaluation. Electronics 9(8), 1295 (2020)
Article Google Scholar
Khan, K., Rehman, S.U., Aziz, K., Fong, S., Sarasvady, S.: DBSCAN: past, present and future. In: The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014), pp. 232–238, February 2014
Google Scholar
Jia, C., Carson, M.B., Wang, X., Yu, J.: Concept decompositions for short text clustering by identifying word communities. Pattern Recogn. 76, 691–703 (2018)
Article Google Scholar
Wan, H., Ning, B., Tao, X., Long, J.: Research on Chinese short text clustering ensemble via convolutional neural networks. In: Liang, Q., Wang, W., Jiasong, Mu., Liu, X., Na, Z., Chen, B. (eds.) Artificial Intelligence in China. LNEE, vol. 572, pp. 622–628. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-0187-6_74
Chapter Google Scholar
Hao, M., Xu, B., Liang, J.Y., Zhang, B.W., Yin, X.C.: Chinese short text classification with mutual-attention convolutional neural networks. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP) 19(5), 1–13 (2020)
Article Google Scholar
Dai, D., et al.: An inception convolutional autoencoder model for Chinese healthcare question clustering. IEEE Trans. Cybern. 51(4), 2019–2031 (2019)
Article Google Scholar
Geng, X., Zhang, Y., Jiao, Y., Mei, Y.: A novel hybrid clustering algorithm for topic detection on Chinese microblogging. IEEE Trans. Comput. Soc. Syst. 6(2), 289–300 (2019)
Article Google Scholar
Chen, J., Gong, Z., Liu, W.: A Dirichlet process biterm-based mixture model for short text stream clustering. Appl. Intell. 50(5), 1609–1619 (2020). https://doi.org/10.1007/s10489-019-01606-1
Article Google Scholar
Zamora, J., Mendoza, M., Allende, H.: Hashing-based clustering in high dimensional data. Expert Syst. Appl. 62, 202–211 (2016)
Article Google Scholar
Cao, W., Yang, P., Ming, Z., Cai, S., Zhang, J.: An improved fuzziness based random vector functional link network for liver disease detection. In: 2020 IEEE 6th International Conference on Big Data Security on Cloud (BigDataSecurity), pp. 42–48, May 2020
Google Scholar
Patwary, M.J., Cao, W., Wang, X.Z., Haque, M.A.: Fuzziness based semi-supervised multimodal learning for patient’s activity recognition using RGBDT videos. Appl. Soft Comput. 120, 108655 (2022)
Article Google Scholar
Tang, G., et al.: A comparative study of neural network techniques for automatic software vulnerability detection. In: 2020 International Symposium on Theoretical Aspects of Software Engineering (TASE), pp. 1–8, December 2020
Google Scholar

Download references

Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant No. 62106150), CAAC Key Laboratory of Civil Aviation Wide Surveillance and Safety Operation Management and Control Technology (Grant No. 202102), and CCF-NSFOCUS (Grant No. 2021001).

Author information

Authors and Affiliations

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China
Yaofeng Chen, Long Ye, Xiaogang Peng & Weipeng Cao
Information Systems, Technical University of Munich, Munich, Germany
Chunyang Zhang
Department of Computer Science, Texas A&M University-Commerce, Commerce, TX, 75428, USA
Meikang Qiu

Authors

Yaofeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chunyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Long Ye
View author publications
You can also search for this author in PubMed Google Scholar
Xiaogang Peng
View author publications
You can also search for this author in PubMed Google Scholar
Meikang Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Weipeng Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weipeng Cao .

Editor information

Editors and Affiliations

Télécom Paris, Paris, France
Gerard Memmi
Purdue University, West Lafayette, IN, USA
Baijian Yang
Shanghai Jiao Tong University, Shanghai, Shanghai, China
Linghe Kong
Nanyang Technological University, Singapore, Singapore
Tianwei Zhang
Texas A&M University – Commerce, Commerce, TX, USA
Meikang Qiu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Y., Zhang, C., Ye, L., Peng, X., Qiu, M., Cao, W. (2022). W-Hash: A Novel Word Hash Clustering Algorithm for Large-Scale Chinese Short Text Analysis. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds) Knowledge Science, Engineering and Management. KSEM 2022. Lecture Notes in Computer Science(), vol 13370. Springer, Cham. https://doi.org/10.1007/978-3-031-10989-8_42

Download citation

DOI: https://doi.org/10.1007/978-3-031-10989-8_42
Published: 19 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10988-1
Online ISBN: 978-3-031-10989-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics