skip to main content
10.1145/3055635.3056654acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlcConference Proceedingsconference-collections
research-article

Soft Short-Text Clustering using PageRank as a Centrality Measure

Published: 24 February 2017 Publication History

Abstract

While a hard clustering algorithm lets a pattern to belong to a single cluster class, soft clustering allows patterns to belong to all cluster classes with different degrees of relationship. This is important in the case of short-text clustering, in which a small-sized text fragment such as a quotation or sentence may be related to more than one subject or topic. However, soft clustering of short-text is complicated by the computational difficulties inherent in defining cluster centroids using conventional cluster centrality measures. Therefore, this paper proposes a new soft short-text clustering algorithm based on the use of PageRank as a centrality measure. Results suggest that when used in hard clustering mode, its performance is on par with state-of-the-art spectral clustering algorithms. Advantages of the algorithm include its ability to perform soft clustering, its ability to operate on non-symmetric matrices, and its ability to converge automatically to an appropriate number of cluster classes.

References

[1]
Abdalgader, K., and Skabar, A. 2011. Short-Text Similarity Measurement Using Word Sense Disambiguation and Synonym Expansion. In Proceedings of the 23rd Australasian Joint Conference on Artificial Intelligence. (AI2010, Adelaide, Australia). vol. LNAI 6464, pp. 435--444.
[2]
Li, Y., McLean, D., Bandar, Z.A., O'Shea, J.D., Crockett, K. 2006. Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Transactions on Knowledge and Data Engineering 18(8), 1138--1150.
[3]
Abdalgader, K. 2016. Text-Fragment Similarity Measurement using Word Sense Identification. International Journal of Applied Engineering Research. vol. 11, no. 24, pp. 11755--11762.
[4]
Wang, D., Zhu, S. & Ding, C. 2008. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 307--314.
[5]
Skabar, A, and Abdalgader, K. 2011. Improving Sentence Similarity Measurement by Incorporating Sentential Word Importance. In Proc. of the 23rd Australasian Joint Conference, Adelaide, Australia, Springer-Verlag, vol. LNAI 6464, pp. 466--475.
[6]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30, 1--7.
[7]
Erkan, Gunes, and Dragomir R. Radev. 2004. LexRank: Graph-based lexical centrality as salience in text summarization. Journal of Art. Int. Research, 22, pages 457--479.
[8]
Mihalcea, Rada, and Paul Tarau. 2004. TextRank: Bringing Order into Texts. Proceedings of EMNLP 2004, pages 404--411
[9]
Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge.
[10]
Rada, R., Mili, H., Bicknell, E. and Blettner, M. 1989. Development and application of a metric to semantic nets, in IEEE Transactions on Systems, Man and Cybernetics. 19(1), 17--30.
[11]
Lin, D. 1998. An information-theoretic definition of similarity, in Proceedings of the 15th International Conference on Machine Learning. Madison, Wisc. (24-27 July 1998), 296--304.
[12]
Jiang, J., and Conrath, D. 1997. Semantic similarity based on corpus statistics and lexical taxonomy, in Proceedings of the 10th International Conference on Research in Computational Linguistics (ROCLING X). Taipei, Taiwan. 19--33.
[13]
Budanitsky, A., and Hirst, G. 2006. Evaluating Wordnet-based measures of lexical semantic relatedness. Computational Linguistics. 32, 1. 13--47.
[14]
Ng, A. Y., Jordan, M.I. & Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Proceedings Neural Information Processing Systems, 849--856 (2001)
[15]
Manning, C.D., Raghavan, P. and Schutze, H. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMLC '17: Proceedings of the 9th International Conference on Machine Learning and Computing
February 2017
545 pages
ISBN:9781450348171
DOI:10.1145/3055635
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Southwest Jiaotong University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Centrality Measurement
  2. PageRank
  3. Soft Short-Text Clustering

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICMLC 2017

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 133
    Total Downloads
  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media