Skip to main content

W-Hash: A Novel Word Hash Clustering Algorithm for Large-Scale Chinese Short Text Analysis

  • Conference paper
  • First Online:
  • 1789 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13370))

Abstract

Short text clustering is an unsupervised learning technique for pattern discovery and analysis of short text datasets, which has been applied to many scenarios such as business risk control and audit. With the development of digitalization over the last few years, the data scale in various scenarios has increased rapidly. Traditional short text clustering methods such as K-means face many challenges in large-scale data analysis, such as difficult to preset hyperparameters and high computational complexity. To alleviate this problem, we propose a novel clustering algorithm called Word Hash clustering algorithm (W-Hash) for Chinese short text analysis. Specifically, W-Hash does not require a pre-specified number of clusters, and it has much lower computational complexity than the traditional clustering approaches. To verify the effectiveness of W-Hash, we apply it to solve a real-life business audit problem. The corresponding experimental results show that W-Hash outperforms traditional clustering algorithms in both training time and result rationality.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Sharma, K.K., Seal, A.: Clustering analysis using an adaptive fused distance. Eng. Appl. Artif. Intell. 96, 103928 (2020)

    Article  Google Scholar 

  2. Tian, T., Zhang, J., Lin, X., Wei, Z., Hakonarson, H.: Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data. Nat. Commun. 12(1), 1–12 (2021)

    Article  Google Scholar 

  3. Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Ann. Data Sci. 2(2), 165–193 (2015). https://doi.org/10.1007/s40745-015-0040-1

    Article  MathSciNet  Google Scholar 

  4. Ahmed, M., Seraj, R., Islam, S.M.S.: The k-means algorithm: a comprehensive survey and performance evaluation. Electronics 9(8), 1295 (2020)

    Article  Google Scholar 

  5. Khan, K., Rehman, S.U., Aziz, K., Fong, S., Sarasvady, S.: DBSCAN: past, present and future. In: The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014), pp. 232–238, February 2014

    Google Scholar 

  6. Jia, C., Carson, M.B., Wang, X., Yu, J.: Concept decompositions for short text clustering by identifying word communities. Pattern Recogn. 76, 691–703 (2018)

    Article  Google Scholar 

  7. Wan, H., Ning, B., Tao, X., Long, J.: Research on Chinese short text clustering ensemble via convolutional neural networks. In: Liang, Q., Wang, W., Jiasong, Mu., Liu, X., Na, Z., Chen, B. (eds.) Artificial Intelligence in China. LNEE, vol. 572, pp. 622–628. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-0187-6_74

    Chapter  Google Scholar 

  8. Hao, M., Xu, B., Liang, J.Y., Zhang, B.W., Yin, X.C.: Chinese short text classification with mutual-attention convolutional neural networks. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP) 19(5), 1–13 (2020)

    Article  Google Scholar 

  9. Dai, D., et al.: An inception convolutional autoencoder model for Chinese healthcare question clustering. IEEE Trans. Cybern. 51(4), 2019–2031 (2019)

    Article  Google Scholar 

  10. Geng, X., Zhang, Y., Jiao, Y., Mei, Y.: A novel hybrid clustering algorithm for topic detection on Chinese microblogging. IEEE Trans. Comput. Soc. Syst. 6(2), 289–300 (2019)

    Article  Google Scholar 

  11. Chen, J., Gong, Z., Liu, W.: A Dirichlet process biterm-based mixture model for short text stream clustering. Appl. Intell. 50(5), 1609–1619 (2020). https://doi.org/10.1007/s10489-019-01606-1

    Article  Google Scholar 

  12. Zamora, J., Mendoza, M., Allende, H.: Hashing-based clustering in high dimensional data. Expert Syst. Appl. 62, 202–211 (2016)

    Article  Google Scholar 

  13. Cao, W., Yang, P., Ming, Z., Cai, S., Zhang, J.: An improved fuzziness based random vector functional link network for liver disease detection. In: 2020 IEEE 6th International Conference on Big Data Security on Cloud (BigDataSecurity), pp. 42–48, May 2020

    Google Scholar 

  14. Patwary, M.J., Cao, W., Wang, X.Z., Haque, M.A.: Fuzziness based semi-supervised multimodal learning for patient’s activity recognition using RGBDT videos. Appl. Soft Comput. 120, 108655 (2022)

    Article  Google Scholar 

  15. Tang, G., et al.: A comparative study of neural network techniques for automatic software vulnerability detection. In: 2020 International Symposium on Theoretical Aspects of Software Engineering (TASE), pp. 1–8, December 2020

    Google Scholar 

Download references

Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant No. 62106150), CAAC Key Laboratory of Civil Aviation Wide Surveillance and Safety Operation Management and Control Technology (Grant No. 202102), and CCF-NSFOCUS (Grant No. 2021001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weipeng Cao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, Y., Zhang, C., Ye, L., Peng, X., Qiu, M., Cao, W. (2022). W-Hash: A Novel Word Hash Clustering Algorithm for Large-Scale Chinese Short Text Analysis. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds) Knowledge Science, Engineering and Management. KSEM 2022. Lecture Notes in Computer Science(), vol 13370. Springer, Cham. https://doi.org/10.1007/978-3-031-10989-8_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-10989-8_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-10988-1

  • Online ISBN: 978-3-031-10989-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics