skip to main content
10.1145/2740908.2742474acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Short-Text Clustering using Statistical Semantics

Published: 18 May 2015 Publication History

Abstract

Short documents are typically represented by very sparse vectors, in the space of terms. In this case, traditional techniques for calculating text similarity results in measures which are very close to zero, since documents even the very similar ones have a very few or mostly no terms in common. In order to alleviate this limitation, the representation of short-text segments should be enriched by incorporating information about correlation between terms. In other words, if two short segments do not have any common words, but terms from the first segment appear frequently with terms from the second segment in other documents, this means that these segments are semantically related, and their similarity measure should be high. Towards achieving this goal, we employ a method for enhancing document clustering using statistical semantics. However, the problem of high computation time arises when calculating correlation between all terms. In this work, we propose the selection of a few terms, and using these terms with the Nystr\"om method to approximate the term-term correlation matrix. The selection of the terms for the Nystr\"om method is performed by randomly sampling terms with probabilities proportional to the lengths of their vectors in the document space. This allows more important terms to have more influence on the approximation of the term-term correlation matrix and accordingly achieves better accuracy.

References

[1]
K. Verma, M. K. Jadon, and A. K. Pujari, "Clustering Short-Text Using Non-negative Matrix Factorization of Hadamard Product of Similarities," Information Retrieval Technology Lecture Notes in Computer Science, Volume 8281, pages 145--155, 2013.
[2]
Z. Faguo, Z. Fan, Y. Bingru, Y. Xingang, "Research on Short Text Classification Algorithm Based on Statistics and Rules," In proceedings of third International Symposium on Electronic Commerce and Security, pages 3--7, 2010.
[3]
G. Salton, A. Wong, C. S. Yang, "A vector space model for automatic indexing," Magazine Communications of the ACM, Volume 18, Issue 11, pages 613--620, Nov. 1975.
[4]
S.Wong, w. Ziark, P. Wong, "Generalized vector spaces model in information retrieval," In Proceedings of the eighth annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pages 18--25, 1985.
[5]
Ahmed K. Farahat, Mohamed S. Kamel, "Statistical semantics for enhancing document clustering," Knowledge and Information Systems, Volume 28, Issue 2, pages 365--393, 2010.
[6]
Kumar, Sanjiv, Mehryar Mohri, and Ameet Talwalkar, "Sampling techniques for the Nyström method," In International Conference on Artificial Intelligence and Statistics, pages 304--311. 2009.
[7]
G. Salton, "An Introduction to Modern Information Retrieval," Mc Graw Hill, 1983.
[8]
K. S. Jones, "A statistical interpretation of term specificity and its application in retrieval," Journal of Documentation, vol. 28, pages 11--21, 1972.
[9]
X. Yan, J. Guo, Sh. Liu, X. Cheng, Y. Wang, "Clustering Short Text Using Ncut weighted Non-negative Matrix Factorization," CIKM 12 Proceedings of the 21st ACM international conference on Information and knowledge management, pages 2259--2262, 2012.
[10]
J. Shi, J. Malik, "Normalized cuts and image segmentation," IEEE Trans PAMI, 22(8), pages 888--905, 2000.
[11]
P. Ferragina, U. Scaiella, "Fast and Accurate Annotation of Short Texts with Wikipedia Pages," Software, IEEE, Volume:29, Issue: 1, 2011.
[12]
X. Hu, N. Sun, C. Zhang, T. Chua, "Exploiting internal and external semantics for the clustering of Short texts using world knowledge," In Proc. CIKM Hong Kong, China, pages 919--928, Nov. 2009.
[13]
P. Lin, Z. Lin, B. Kuang, P. Huang, "A Short Chinese Text Incremental Clustering Algorithm Based on Weighted Semantics and Naive Bayes," Journal of Computational Information Systems, 8(10), pages 4257--4268, 2012.
[14]
E. Vozalis, K. Margaritis, "Analysis of recommender systems algorithms," In Proceedings of the 6th Hellenic European Conference on Computer Mathematics and its Applications, Athens, Greece, 2003.
[15]
Sindhwani, V., Thomas J., Yorktown Heights, Ghoting, Ting, Lawrence, "Extracting insights from social media with large-scale matrix approximations," IBM Journal of Research and Development, Vol. 55, Issue: 5, 2011.
[16]
A. Cichocki, R. Zdunek, A. Phan, and S. Amari, "Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation," John Wiley & Sons Ltd: Chichester, UK, 2009.
[17]
N. D. Ho, P. V. Dooren, and V. D. Blondel, "Descent methods for non-negative matrix factorization, in Numerical Linear Algebra in Signals," Systems and Control, 2007.
[18]
N. Gillis, "Nonnegative matrix factorization: Complexity, algorithms and applications," M.S. thesis, Univ. Catholique de Louvain, Louvain-la-Neuve, Belgium, 2011.
[19]
Kumar, Sanjiv, Mehryar Mohri, and Ameet Talwalkar, "Sampling techniques for the Nyström method," In International Conference on Artificial Intelligence and Statistics, pages 304--311. 2009.
[20]
Zubiaga, Arkaitz, and Heng Ji, "Harnessing web page directories for largescale classification of tweets," Proceedings of the 22nd international conference on World Wide Web companion. International World Wide Web Conferences Steering Committee, pages 225--226, 2013.
[21]
Nathan D. Cahill, "Normalized measures of mutual information with general definitions of entropy for multimodal image registration," In Proceedings of the 4th international conference on Biomedical image registration, WBIR'10, pages 258--AS268, Berlin, Heidelberg, 2010.
[22]
S. Zhong, "Efficient Online Sphercal K-means Clustering," Proceedings of International Joint Conference on Neural Networks (IJCNN), Montreal, Canada, 2005.
[23]
G. Pant, K. Tsioutsiouliklis, J. Johnson, C. L. Giles, "Panorama: extending digital libraries with topical crawlers", Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, pages 142--150, 2004

Cited By

View all
  • (2024)Anomaly-aware symmetric non-negative matrix factorization for short text clusteringKnowledge and Information Systems10.1007/s10115-024-02226-z67:2(1481-1506)Online publication date: 4-Nov-2024
  • (2023)Evaluation and assessment of machine learning based user story grouping: A framework and empirical studiesScience of Computer Programming10.1016/j.scico.2023.102943(102943)Online publication date: Mar-2023
  • (2023)User story clustering in agile development: a framework and an empirical studyFrontiers of Computer Science10.1007/s11704-022-8262-917:6Online publication date: 21-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '15 Companion: Proceedings of the 24th International Conference on World Wide Web
May 2015
1602 pages
ISBN:9781450334730
DOI:10.1145/2740908

Sponsors

  • IW3C2: International World Wide Web Conference Committee

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. nystrom approximation
  2. short text clustering

Qualifiers

  • Research-article

Funding Sources

  • Qatar National Research Fund through National Priority Research Program (NPRP)

Conference

WWW '15
Sponsor:
  • IW3C2

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Anomaly-aware symmetric non-negative matrix factorization for short text clusteringKnowledge and Information Systems10.1007/s10115-024-02226-z67:2(1481-1506)Online publication date: 4-Nov-2024
  • (2023)Evaluation and assessment of machine learning based user story grouping: A framework and empirical studiesScience of Computer Programming10.1016/j.scico.2023.102943(102943)Online publication date: Mar-2023
  • (2023)User story clustering in agile development: a framework and an empirical studyFrontiers of Computer Science10.1007/s11704-022-8262-917:6Online publication date: 21-Jan-2023
  • (2022)Minute-Paper Dashboard: Identification of Learner’s Misconceptions Using Topic Modeling on Formative Reflections2022 IEEE Frontiers in Education Conference (FIE)10.1109/FIE56618.2022.9962598(1-5)Online publication date: 8-Oct-2022
  • (2021)Topic Modeling for Customer Service Chats2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS)10.1109/ICACSIS53237.2021.9631322(1-6)Online publication date: 23-Oct-2021
  • (2021)A Novel Text Ensemble Clustering Based on Weighted Entropy Filtering ModelJournal of Physics: Conference Series10.1088/1742-6596/2024/1/0120452024:1(012045)Online publication date: 1-Sep-2021
  • (2020)Confronting Sparseness and High Dimensionality in Short Text Clustering via Feature Vector Projections2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI50040.2020.00129(813-820)Online publication date: Nov-2020
  • (2019)Semantic Sparse Service Discovery Using Word Embedding and Gaussian LDAIEEE Access10.1109/ACCESS.2019.29265597(88231-88242)Online publication date: 2019
  • (2019)Taxonomy-Augmented Features for Document ClusteringData Mining10.1007/978-981-13-6661-1_19(241-252)Online publication date: 16-Feb-2019
  • (2019)Survey on Social Networks Data AnalysisInnovations for Community Services10.1007/978-3-030-37484-6_6(100-119)Online publication date: 15-Dec-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media