skip to main content
10.1145/3308558.3313397acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Classifying Extremely Short Texts by Exploiting Semantic Centroids in Word Mover's Distance Space

Published: 13 May 2019 Publication History

Abstract

Automatically classifying extremely short texts, such as social media posts and web page titles, plays an important role in a wide range of content analysis applications. However, traditional classifiers based on bag-of-words (BoW) representations often fail in this task. The underlying reason is that the document similarity can not be accurately measured under BoW representations due to the extreme sparseness of short texts. This results in significant difficulty to capture the generality of short texts. To address this problem, we use a better regularized word mover's distance (RWMD), which can measure distances among short texts at the semantic level. We then propose a RWMD-based centroid classifier for short texts, named RWMD-CC. Basically, RWMD-CC computes a representative semantic centroid for each category under the RWMD measure, and predicts test documents by finding the closest semantic centroid. The testing is much more efficient than the prior art of K nearest neighbor classifier based on WMD. Experimental results indicate that our RWMD-CC can achieve very competitive classification performance on extremely short texts.

References

[1]
Peter L. Bartlett and Shahar Mendelson. 2002. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research3 (2002), 463-482.
[2]
Dimitris Bertsimas and John N Tsitsiklis. 1997. Introduction to Linear Optimization. Athena Scientific Belmont, MA.
[3]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research3 (2003), 993-1022.
[4]
Vladimir Igorevich Bogachev and Aleksandr Viktorovich Kolesnikov. 2012. The Monge-Kantorovich Problem: Achievements, Connections, and Perspectives. Russian Mathematical Surveys67, 5 (2012), 785-890.
[5]
Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. 2014. BTM: Topic Modeling over Short Texts. IEEE Transactions on Knowledge and Data Engineering26, 12(2014), 2928-2941.
[6]
Marco Cuturi. 2013. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Neural Information Processing Systems. 2292-2300.
[7]
Marco Cuturi and Arnaud Doucet. 2014. Fast Computation of Wasserstein Barycenters. In International Conference on Machine Learning. 685-693.
[8]
Honghua (Kathy) Dai, Lingzhi Zhao, Zaiqing Nie, Ji-Rong Wen, Lee Wang, and Ying Li. 2006. Detecting Online Commercial Intention (OCI). In International Conference on World Wide Web. 829-837.
[9]
Cícero Nogueira dos Santos and Maíra Gatt. 2014. Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts. In International Conference on Computational Linguistics. 69-78.
[10]
Charlie Frogner, Chiyuan Zhang, Hossein Mobahi an Mauricio Araya-Polo, and Tomaso Pogglo. 2015. Learning with a Wasserstein Loss. In Neural Information Processing Systems. 2053-2061.
[11]
Hu Guan, Jingyu Zhou, and Minyi Guo. 2009. A Class-Feature-Centroid Classifier for Text Categorization. In International Conference on World Wide Web. 201-210.
[12]
Xia Hu, Nan Sun, Chao Zhang, and Tat-Seng Chua. 2009. Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge. In ACM International Conference on Web Search and Data Mining. 353-362.
[13]
Gao Huang, Chuan Guo, Matt J. Kusner, Yu Sun, Kilian Q. Weinberger, and Fei Sha. 2016. Supervised Word Mover's Distance. In Neural Information Processing Systems. 4862-4870.
[14]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980(2014).
[15]
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings To Document Distances. In International Conference on Machine Learning. 957-966.
[16]
Ximing Li, Jinjin Chi, Changchun Li, Jihong Ouyang, and Bo Fu. 2016. Integrating Topic Modeling with Word Embeddings by Mixtures of vMFs. In International Conference on Computational Linguistics. 151-160.
[17]
Ximing Li, Changchun Li, Jinjin Chi, and Jihong Ouyang. 2018. Short Text Topic Modeling by Exploring Original Documents. Knowledge and Information Systems56, 2 (2018), 443-462.
[18]
Ximing Li, Changchun Li, Jinjin Chi, Jihong Ouyang, and Chenliang Li. 2018. Dataless Text Classification: A Topic Modeling Approach with Document Manifold. In ACM International Conference on Information and Knowledge Management. 973-982.
[19]
Ximing Li, Yue Wang, Ang Zhang, Changchun Li, Jinjin Chi, and Jihong Ouyang. 2018. Filtering out the Noise in Short Text Topic Modeling. Information Sciences456(2018), 83-96.
[20]
Ximing Li and Bo Yang. 2018. A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words. In International Conference on Computational Linguistics. 1908-1917.
[21]
Yuhua Li, David McLean, Zuhair A. Bandar, James D. O'Shea, and Keeley Crockett. 2006. Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Transactions on Knowledge and Data Engineering16, 8(2006), 1138-1150.
[22]
Andreas Maurer. 2016. A Vector-contraction Inequality for Rademacher Complexities. In International Conference on Algorithmic Learning Theory. 3-17.
[23]
Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space Word Representations. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2292-2300.
[24]
Ofir Pele and Michael Werman. 2009. Fast and Robust Earth Mover's Distances. In International Conference on Computer Vision. 460-467.
[25]
Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. 2008. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. In International Conference on World Wide Web. 91-100.
[26]
Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. 1998. A Metric for Distributions with Applications to Image Databases. In International Conference on Computer Vision. 59-66.
[27]
Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. 2000. The Earth Mover's Distance as a Metric for Image Retrieval. International Journal of Computer Vision40, 2 (2000), 99-121.
[28]
Petra Schneider, Michael Biehl, and Barbara Hammer. 2009. Adaptive Relevance Matrices in Learning Vector Quantization. Neural Computation21, 12 (2009), 3532-3561.
[29]
Dou Shen, Rong Pan, Jian-Tao Sun, Jeffrey Junfeng Pan, Kangheng Wu, Jie Yin, and Qiang Yang. 2006. Query Enrichment for Web-query Classification. ACM Transactions on Information Systems24, 3 (2006), 320-352.
[30]
Richard Sinkhorn. 1967. Diagonal Equivalence to Matrices with Prescribed Row and Column Sums. The American Mathematical Monthly74, 4 (1967), 402-405.
[31]
Yang Song, Dengyong Zhou, and Li wei He. 2012. Query Suggestion by Constructing Term-Transition Graphs. In ACM International Conference on Web Search and Data Mining. 353-362.
[32]
Aixin Sun. 2012. Short Text Classification Using Very Few Words. In ACM SIGIR Conference on Research and Development in Information Retrieval. 1145-1146.
[33]
Songbo Tan and Xueqi Cheng. 2007. Using Hypothesis Margin to Boost Centroid Text Classifier. In ACM Symposium on Applied Computing. 398-403.
[34]
Ce´dric Villani. 2008. Optimal Transport: Old and New. Springer Berlin Heidelberg.
[35]
Fang Wang, Zhongyuan Wang, Zhoujun Li, and Ji-Rong Wen. 2014. Concept-based Short Text Classification and Ranking. In ACM International Conference on Information and Knowledge Management. 1069-1078.
[36]
Peng Wang, Bo Xu, Jiaming Xu, Guanhua Tian, Cheng-Lin Liu, and Hongwei Hao. 2016. Semantic Expansion Using Word Embedding Clustering and Convolutional Neural Network for Improving Short Text Classification. Neurocomputing174, Part B (2016), 806-814.
[37]
Tao Wang, Yi Cai, Ho-fung Leung, Zhiwei Cai, and Huaqing Min. 2015. Entropy-Based Term Weighting Schemes for Text Categorization in VSM. In IEEE International Conference on Tools with Artificial Intelligence. 325-332.
[38]
Yang Wang, Xuemin Lin, Lin Wu, Wenjie Zhang, Qing Zhang, and Xiaodi Huang. 2015. Robust Subspace Clustering for Multi-view Data by Exploiting Correlation Consensus. IEEE Transactions on Image Processing24, 11 (2015), 3939-3949.
[39]
Yang Wang, Lin Wu, Xuemin Lin, and Junbin Gao. 2018. Multiview Spectral Clustering via Structured Low-Rank Matrix Factorization. IEEE Transactions on Neural Networks and Learning Systems29, 10(2018), 4833-4843.
[40]
Yang Wang, Wenjie Zhang, Lin Wu, Xuemin Lin, Meng Fang, and Shirui Pan. 2016. Iterative Views Agreement: An Iterative Low-rank based Structured Optimization Method to Multi-view Spectral Clustering. In International Joint Conference on Artificial Intelligence. 2153-2159.
[41]
Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. 2010. TwitterRank: Finding Topic-sensitive Influential Twitterers. In ACM International Conference on Web Search and Data Mining. 261-270.
[42]
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In Neural Information Processing Systems. 649-657.
[43]
Peng Zhao and Zhi-Hua Zhou. 2018. Label Distribution Learning by Optimal Transport. In AAAI Conference on Artificial Intelligence. 4506-4513.
[44]
Yuan Zuo, Jichang Zhao, and Ke Xu. 2016. Word Network Topic Model: A Simple but General Solution for Short and Imbalanced Texts. Knowledge and Information Systems48, 2 (2016), 379-398.

Cited By

View all
  • (2024)A Tool Kit for Relation Induction in Text AnalysisSociological Methods & Research10.1177/00491241241233242Online publication date: 29-Feb-2024
  • (2024)OneSpace: Detecting cross-language clones by learning a common embedding spaceJournal of Systems and Software10.1016/j.jss.2023.111911208(111911)Online publication date: Feb-2024
  • (2024)Cell-to-cell distance that combines gene expression and gene embeddingsComputational and Structural Biotechnology Journal10.1016/j.csbj.2024.10.044Online publication date: Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '19: The World Wide Web Conference
May 2019
3620 pages
ISBN:9781450366748
DOI:10.1145/3308558
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Extremely Short Texts
  2. Hypothesis Margin
  3. Regularized Word Mover's Distance
  4. Semantic Centroid

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '19
WWW '19: The Web Conference
May 13 - 17, 2019
CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)3
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Tool Kit for Relation Induction in Text AnalysisSociological Methods & Research10.1177/00491241241233242Online publication date: 29-Feb-2024
  • (2024)OneSpace: Detecting cross-language clones by learning a common embedding spaceJournal of Systems and Software10.1016/j.jss.2023.111911208(111911)Online publication date: Feb-2024
  • (2024)Cell-to-cell distance that combines gene expression and gene embeddingsComputational and Structural Biotechnology Journal10.1016/j.csbj.2024.10.044Online publication date: Nov-2024
  • (2024)Anomaly-aware symmetric non-negative matrix factorization for short text clusteringKnowledge and Information Systems10.1007/s10115-024-02226-z67:2(1481-1506)Online publication date: 4-Nov-2024
  • (2022)A Word-Concept Heterogeneous Graph Convolutional Network for Short Text ClassificationNeural Processing Letters10.1007/s11063-022-10906-655:1(735-750)Online publication date: 22-Jun-2022
  • (2021)Approximate continuous optimal transport with copulasInternational Journal of Intelligent Systems10.1002/int.2279537:8(5354-5380)Online publication date: 30-Dec-2021
  • (2020)Mitigating Matching Scarcity in Recruitment Recommendation DomainsProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3428658.3430968(165-172)Online publication date: 30-Nov-2020
  • (2020)Invisible Stories That Drive Online Social CognitionIEEE Transactions on Computational Social Systems10.1109/TCSS.2020.30094747:5(1264-1277)Online publication date: Oct-2020
  • (2020)Semantics-Assisted Wasserstein Learning for Topic and Word Embeddings2020 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM50108.2020.00038(292-301)Online publication date: Nov-2020
  • (2020)Topic Modeling on User Stories using Word Mover's Distance2020 IEEE Seventh International Workshop on Artificial Intelligence for Requirements Engineering (AIRE)10.1109/AIRE51212.2020.00015(52-60)Online publication date: Sep-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media