research-article

Classifying Extremely Short Texts by Exploiting Semantic Centroids in Word Mover's Distance Space

Authors:

Ximing LiAuthors Info & Claims

WWW '19: The World Wide Web Conference

Pages 939 - 949

https://doi.org/10.1145/3308558.3313397

Published: 13 May 2019 Publication History

Abstract

Automatically classifying extremely short texts, such as social media posts and web page titles, plays an important role in a wide range of content analysis applications. However, traditional classifiers based on bag-of-words (BoW) representations often fail in this task. The underlying reason is that the document similarity can not be accurately measured under BoW representations due to the extreme sparseness of short texts. This results in significant difficulty to capture the generality of short texts. To address this problem, we use a better regularized word mover's distance (RWMD), which can measure distances among short texts at the semantic level. We then propose a RWMD-based centroid classifier for short texts, named RWMD-CC. Basically, RWMD-CC computes a representative semantic centroid for each category under the RWMD measure, and predicts test documents by finding the closest semantic centroid. The testing is much more efficient than the prior art of K nearest neighbor classifier based on WMD. Experimental results indicate that our RWMD-CC can achieve very competitive classification performance on extremely short texts.

References

[1]

Peter L. Bartlett and Shahar Mendelson. 2002. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research3 (2002), 463-482.

Digital Library

[2]

Dimitris Bertsimas and John N Tsitsiklis. 1997. Introduction to Linear Optimization. Athena Scientific Belmont, MA.

Digital Library

[3]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research3 (2003), 993-1022.

[4]

Vladimir Igorevich Bogachev and Aleksandr Viktorovich Kolesnikov. 2012. The Monge-Kantorovich Problem: Achievements, Connections, and Perspectives. Russian Mathematical Surveys67, 5 (2012), 785-890.

[5]

Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. 2014. BTM: Topic Modeling over Short Texts. IEEE Transactions on Knowledge and Data Engineering26, 12(2014), 2928-2941.

[6]

Marco Cuturi. 2013. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Neural Information Processing Systems. 2292-2300.

Digital Library

[7]

Marco Cuturi and Arnaud Doucet. 2014. Fast Computation of Wasserstein Barycenters. In International Conference on Machine Learning. 685-693.

Digital Library

[8]

Honghua (Kathy) Dai, Lingzhi Zhao, Zaiqing Nie, Ji-Rong Wen, Lee Wang, and Ying Li. 2006. Detecting Online Commercial Intention (OCI). In International Conference on World Wide Web. 829-837.

Digital Library

[9]

Cícero Nogueira dos Santos and Maíra Gatt. 2014. Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts. In International Conference on Computational Linguistics. 69-78.

[10]

Charlie Frogner, Chiyuan Zhang, Hossein Mobahi an Mauricio Araya-Polo, and Tomaso Pogglo. 2015. Learning with a Wasserstein Loss. In Neural Information Processing Systems. 2053-2061.

Digital Library

[11]

Hu Guan, Jingyu Zhou, and Minyi Guo. 2009. A Class-Feature-Centroid Classifier for Text Categorization. In International Conference on World Wide Web. 201-210.

Digital Library

[12]

Xia Hu, Nan Sun, Chao Zhang, and Tat-Seng Chua. 2009. Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge. In ACM International Conference on Web Search and Data Mining. 353-362.

[13]

Gao Huang, Chuan Guo, Matt J. Kusner, Yu Sun, Kilian Q. Weinberger, and Fei Sha. 2016. Supervised Word Mover's Distance. In Neural Information Processing Systems. 4862-4870.

Digital Library

[14]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980(2014).

[15]

Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings To Document Distances. In International Conference on Machine Learning. 957-966.

Digital Library

[16]

Ximing Li, Jinjin Chi, Changchun Li, Jihong Ouyang, and Bo Fu. 2016. Integrating Topic Modeling with Word Embeddings by Mixtures of vMFs. In International Conference on Computational Linguistics. 151-160.

[17]

Ximing Li, Changchun Li, Jinjin Chi, and Jihong Ouyang. 2018. Short Text Topic Modeling by Exploring Original Documents. Knowledge and Information Systems56, 2 (2018), 443-462.

Digital Library

[18]

Ximing Li, Changchun Li, Jinjin Chi, Jihong Ouyang, and Chenliang Li. 2018. Dataless Text Classification: A Topic Modeling Approach with Document Manifold. In ACM International Conference on Information and Knowledge Management. 973-982.

Digital Library

[19]

Ximing Li, Yue Wang, Ang Zhang, Changchun Li, Jinjin Chi, and Jihong Ouyang. 2018. Filtering out the Noise in Short Text Topic Modeling. Information Sciences456(2018), 83-96.

[20]

Ximing Li and Bo Yang. 2018. A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words. In International Conference on Computational Linguistics. 1908-1917.

[21]

Yuhua Li, David McLean, Zuhair A. Bandar, James D. O'Shea, and Keeley Crockett. 2006. Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Transactions on Knowledge and Data Engineering16, 8(2006), 1138-1150.

Digital Library

[22]

Andreas Maurer. 2016. A Vector-contraction Inequality for Rademacher Complexities. In International Conference on Algorithmic Learning Theory. 3-17.

Digital Library

[23]

Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space Word Representations. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2292-2300.

[24]

Ofir Pele and Michael Werman. 2009. Fast and Robust Earth Mover's Distances. In International Conference on Computer Vision. 460-467.

[25]

Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. 2008. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. In International Conference on World Wide Web. 91-100.

Digital Library

[26]

Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. 1998. A Metric for Distributions with Applications to Image Databases. In International Conference on Computer Vision. 59-66.

Digital Library

[27]

Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. 2000. The Earth Mover's Distance as a Metric for Image Retrieval. International Journal of Computer Vision40, 2 (2000), 99-121.

Digital Library

[28]

Petra Schneider, Michael Biehl, and Barbara Hammer. 2009. Adaptive Relevance Matrices in Learning Vector Quantization. Neural Computation21, 12 (2009), 3532-3561.

Digital Library

[29]

Dou Shen, Rong Pan, Jian-Tao Sun, Jeffrey Junfeng Pan, Kangheng Wu, Jie Yin, and Qiang Yang. 2006. Query Enrichment for Web-query Classification. ACM Transactions on Information Systems24, 3 (2006), 320-352.

Digital Library

[30]

Richard Sinkhorn. 1967. Diagonal Equivalence to Matrices with Prescribed Row and Column Sums. The American Mathematical Monthly74, 4 (1967), 402-405.

[31]

Yang Song, Dengyong Zhou, and Li wei He. 2012. Query Suggestion by Constructing Term-Transition Graphs. In ACM International Conference on Web Search and Data Mining. 353-362.

Digital Library

[32]

Aixin Sun. 2012. Short Text Classification Using Very Few Words. In ACM SIGIR Conference on Research and Development in Information Retrieval. 1145-1146.

Digital Library

[33]

Songbo Tan and Xueqi Cheng. 2007. Using Hypothesis Margin to Boost Centroid Text Classifier. In ACM Symposium on Applied Computing. 398-403.

Digital Library

[34]

Ce´dric Villani. 2008. Optimal Transport: Old and New. Springer Berlin Heidelberg.

[35]

Fang Wang, Zhongyuan Wang, Zhoujun Li, and Ji-Rong Wen. 2014. Concept-based Short Text Classification and Ranking. In ACM International Conference on Information and Knowledge Management. 1069-1078.

Digital Library

[36]

Peng Wang, Bo Xu, Jiaming Xu, Guanhua Tian, Cheng-Lin Liu, and Hongwei Hao. 2016. Semantic Expansion Using Word Embedding Clustering and Convolutional Neural Network for Improving Short Text Classification. Neurocomputing174, Part B (2016), 806-814.

Digital Library

[37]

Tao Wang, Yi Cai, Ho-fung Leung, Zhiwei Cai, and Huaqing Min. 2015. Entropy-Based Term Weighting Schemes for Text Categorization in VSM. In IEEE International Conference on Tools with Artificial Intelligence. 325-332.

Digital Library

[38]

Yang Wang, Xuemin Lin, Lin Wu, Wenjie Zhang, Qing Zhang, and Xiaodi Huang. 2015. Robust Subspace Clustering for Multi-view Data by Exploiting Correlation Consensus. IEEE Transactions on Image Processing24, 11 (2015), 3939-3949.

Digital Library

[39]

Yang Wang, Lin Wu, Xuemin Lin, and Junbin Gao. 2018. Multiview Spectral Clustering via Structured Low-Rank Matrix Factorization. IEEE Transactions on Neural Networks and Learning Systems29, 10(2018), 4833-4843.

[40]

Yang Wang, Wenjie Zhang, Lin Wu, Xuemin Lin, Meng Fang, and Shirui Pan. 2016. Iterative Views Agreement: An Iterative Low-rank based Structured Optimization Method to Multi-view Spectral Clustering. In International Joint Conference on Artificial Intelligence. 2153-2159.

Digital Library

[41]

Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. 2010. TwitterRank: Finding Topic-sensitive Influential Twitterers. In ACM International Conference on Web Search and Data Mining. 261-270.

Digital Library

[42]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In Neural Information Processing Systems. 649-657.

Digital Library

[43]

Peng Zhao and Zhi-Hua Zhou. 2018. Label Distribution Learning by Optimal Transport. In AAAI Conference on Artificial Intelligence. 4506-4513.

[44]

Yuan Zuo, Jichang Zhao, and Ke Xu. 2016. Word Network Topic Model: A Simple but General Solution for Short and Imbalanced Texts. Knowledge and Information Systems48, 2 (2016), 379-398.

Digital Library

Cited By

Stoltz DTaylor MDudley J(2024)A Tool Kit for Relation Induction in Text AnalysisSociological Methods & Research10.1177/00491241241233242Online publication date: 29-Feb-2024
https://doi.org/10.1177/00491241241233242
El Arnaoty MServant F(2024)OneSpace: Detecting cross-language clones by learning a common embedding spaceJournal of Systems and Software10.1016/j.jss.2023.111911208(111911)Online publication date: Feb-2024
https://doi.org/10.1016/j.jss.2023.111911
Guo FGan DLi J(2024)Cell-to-cell distance that combines gene expression and gene embeddingsComputational and Structural Biotechnology Journal10.1016/j.csbj.2024.10.044Online publication date: Nov-2024
https://doi.org/10.1016/j.csbj.2024.10.044
Show More Cited By

Recommendations

Word Mover’s Distance for Agglomerative Short Text Clustering
Intelligent Information and Database Systems
Abstract
In the era of information overload, text clustering plays an important part in the analysis processing pipeline. Partitioning high-quality texts into unseen categories tremendously helps applications in information retrieval, databases, and ...
Supervised word mover's distance
NIPS'16: Proceedings of the 30th International Conference on Neural Information Processing Systems

Recently, a new document metric called the word mover's distance (WMD) has been proposed with unprecedented results on kNN-based document classification. The WMD elevates high-quality word embeddings to a document metric by formulating the distance ...
The earth mover's distance as a semantic measure for document similarity
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

Different words are usually assumed to be semantically independent in most existing similarity measures, which is not often true in practice. The semantic relatedness between words cannot be conveniently employed in the existing measures. We propose a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '19: The World Wide Web Conference

May 2019

3620 pages

ISBN:9781450366748

DOI:10.1145/3308558

Editors:
Ling Liu
Georgia Tech, USA
,
Ryen White
Microsoft Research, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '19

WWW '19: The Web Conference

May 13 - 17, 2019

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
1,004
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)3

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Stoltz DTaylor MDudley J(2024)A Tool Kit for Relation Induction in Text AnalysisSociological Methods & Research10.1177/00491241241233242Online publication date: 29-Feb-2024
https://doi.org/10.1177/00491241241233242
El Arnaoty MServant F(2024)OneSpace: Detecting cross-language clones by learning a common embedding spaceJournal of Systems and Software10.1016/j.jss.2023.111911208(111911)Online publication date: Feb-2024
https://doi.org/10.1016/j.jss.2023.111911
Guo FGan DLi J(2024)Cell-to-cell distance that combines gene expression and gene embeddingsComputational and Structural Biotechnology Journal10.1016/j.csbj.2024.10.044Online publication date: Nov-2024
https://doi.org/10.1016/j.csbj.2024.10.044
Li XGuan YFu BLuo Z(2024)Anomaly-aware symmetric non-negative matrix factorization for short text clusteringKnowledge and Information Systems10.1007/s10115-024-02226-z67:2(1481-1506)Online publication date: 4-Nov-2024
https://doi.org/10.1007/s10115-024-02226-z
Yang SLiu YZhang YZhu J(2022)A Word-Concept Heterogeneous Graph Convolutional Network for Short Text ClassificationNeural Processing Letters10.1007/s11063-022-10906-655:1(735-750)Online publication date: 22-Jun-2022
https://dl.acm.org/doi/10.1007/s11063-022-10906-6
Chi JWang BChen HZhang LLi XOuyang J(2021)Approximate continuous optimal transport with copulasInternational Journal of Intelligent Systems10.1002/int.2279537:8(5354-5380)Online publication date: 30-Dec-2021
https://dl.acm.org/doi/10.1002/int.22795
Cardoso AMourão FRocha Lde Salles Soares Neto C(2020)Mitigating Matching Scarcity in Recruitment Recommendation DomainsProceedings of the Brazilian Symposium on Multimedia and the Web10.1145/3428658.3430968(165-172)Online publication date: 30-Nov-2020
https://dl.acm.org/doi/10.1145/3428658.3430968
Subbanarasimha RSrinivasa SMandyam S(2020)Invisible Stories That Drive Online Social CognitionIEEE Transactions on Computational Social Systems10.1109/TCSS.2020.30094747:5(1264-1277)Online publication date: Oct-2020
https://doi.org/10.1109/TCSS.2020.3009474
Li CLi XOuyang JWang Y(2020)Semantics-Assisted Wasserstein Learning for Topic and Word Embeddings2020 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM50108.2020.00038(292-301)Online publication date: Nov-2020
https://doi.org/10.1109/ICDM50108.2020.00038
Gulle KFord NEbel PBrokhausen FVogelsang A(2020)Topic Modeling on User Stories using Word Mover's Distance2020 IEEE Seventh International Workshop on Artificial Intelligence for Requirements Engineering (AIRE)10.1109/AIRE51212.2020.00015(52-60)Online publication date: Sep-2020
https://doi.org/10.1109/AIRE51212.2020.00015
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten