Enriching short text representation in microblog for clustering

Tang, Jiliang; Wang, Xufei; Gao, Huiji; Hu, Xia; Liu, Huan

doi:10.1007/s11704-011-1167-7

Enriching short text representation in microblog for clustering

Research Article
Published: 27 January 2012

Volume 6, pages 88–101, (2012)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Jiliang Tang¹,
Xufei Wang¹,
Huiji Gao¹,
Xia Hu¹ &
…
Huan Liu¹

417 Accesses
57 Citations
Explore all metrics

Abstract

Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbreviations, and coined acronyms and words exacerbate the problems of synonymy and polysemy, and bring about new challenges to data mining applications such as text clustering and classification. To address these issues, we dissect some potential causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed approach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investigate potential factors that contribute to the improved performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Filtering-Based Text Sentiment Analysis for Twitter Dataset

A novel framework based on bi-objective optimization and LAN²FIS for Twitter sentiment analysis

Article 25 May 2020

Sentiment Analysis of Arabic and English Tweets

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Adamic L A, Zhang J, Bakshy E, Ackerman M S. Knowledge sharing and yahoo answers: everyone knows something. In: Proceedings of 17th International Conference on World Wide Web. 2008, 665–674
Hotho A, Staab S, Stumme G. Wordnet improves text document clustering. In: Proceedings of 2003 SIGIR Semantic WebWorkshop. 2003, 541–544
Reforgiato Recupero D. A new unsupervised method for document clustering by using WordNet lexical and conceptual relations. Information Retrieval, 2007, 10(6): 563–579
Article Google Scholar
Hu J, Fang L, Cao Y, Zeng H J, Li H, Yang Q, Chen Z. Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2008, 179–186
Hu X, Zhang X, Lu C, Park E K, Zhou X. Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 389–396
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993–1022
MATH Google Scholar
Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999, 50–57
Xu W, Liu X, Gong Y. Document clustering based on non-negative matrix factorization. In: Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2003, 267–273
Lin C J. Projected gradient methods for non-negative matrix factorization. Neural Computation, 2007, 19(10): 2756–2779
Article MathSciNet MATH Google Scholar
Cutting D R, Pedersen J O, Karger D R, Tukey J W. Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1992, 318–329
Dave K, Lawrence S, Pennock D M. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of 12th International Conference on World Wide Web. 2003, 519–528
Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In: Proceedings of 2000 KDD Workshop on Text Mining. 2000, 525–526
Banerjee S, Ramanathan K, Gupta A. Clustering short texts using Wikipedia. In: Proceedings of 30th Annual International ACM SIGIR Conference on Research and Development. 2007, 787–788
Lee D D, Seung H S. Algorithms for non-negative matrix factorization. In: Proceedings of 2000 Neural Information Processing Systems. 2000, 556–562
Hu X, Sun N, Zhang C, Chua T S. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceedings of 18th ACM Conference on Information and Knowledge Management. 2009, 919–928
Halkdi M, Nguyen B, Varlamis I, Vazirgiannis M. THESUS: organizing Web document collections based on link sematics. The VLDB Journal, 2003, 12(4): 320–332
Article Google Scholar
Yoo I, Hu X, Song I Y. Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In: Proceedings of 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006, 791–796
Gabrilovich E, Markovitch S. Feature generation for text categorization using world knowledge. In: Proceedings of 19th International Joint Conference on Artificial Intelligence. 2005, 1048–1053
Gabrilovich E, Markovitch S. Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of 21st National Conference on Artificial Intelligence, Vol 2. 2006, 1301–1306
Fodeh S, Punch B, Tan P N. On ontology-driven document clustering using core semantic features. Knowledge and Information Systems, 2011, 28(2): 395–421
Article Google Scholar
Kasneci G, Ramanath M, Suchanek F, Weikum G. The YAGO-NAGA approach to knowledge discovery. ACM SIGMOD Record, 2008, 37(4): 41–47
Article Google Scholar
Theobald M, Bast H, Majumdar D, Schenkel R, Weikum G. TopX: efficient and versatile top-k query processing for semistructured data. The VLDB Journal, 2008, 17(1): 81–115
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science & Engineering, Arizona State University, Tempe, AZ, 85281, USA
Jiliang Tang, Xufei Wang, Huiji Gao, Xia Hu & Huan Liu

Authors

Jiliang Tang
View author publications
Search author on:PubMed Google Scholar
Xufei Wang
View author publications
Search author on:PubMed Google Scholar
Huiji Gao
View author publications
Search author on:PubMed Google Scholar
Xia Hu
View author publications
Search author on:PubMed Google Scholar
Huan Liu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Jiliang Tang.

Additional information

Jiliang Tang is a PhD student in computer science and engineering at Arizona State University. He received his BSc and MSc degrees from Beijing Institute of Technology in 2008 and 2010. His research interests include data mining and machine learning. Specifically, he is interested in social computing and feature selection in social media.

Xufei Wang is a PhD student in computer science and engineering at Arizona State University. He received his Masters degree from Tsinghua University, and Bachelor degree of Science from Zhejiang University, China. His research interests are in social computing and data mining. Specifically, he is interested in mining social media data, social network analysis, mining ego-centric friend structure, tag network, crowdsourcing, etc. He is an IEEE student member.

Huiji Gao is a PhD student in Data Mining and Machine Learning (DMML) Lab at Arizona State University (ASU). He received his BSc and MSc degrees from Beijing University of Posts and Telecommunications, China in 2007 and 2010. His research interests include social computing, data mining, and social media mining, in particular, crowdsourcing and spatial-temporal mining. Contact him at huiji.gao@asu.edu.

Xia Hu is a PhD student of Computer Science and Engineering at Arizona State University. He received his Master and Bachelor degrees from the School of Computer Science and Engineering of Beihang University. His research interests are in text analytics in social media, social network analysis, machine learning, text representation, sentiment analysis, etc. He was awarded an ASU GPSA Travel Grant, Machine Learning Summer School at Purdue Fellowship, SDM Doctoral Student Forum Fellowship, and various Student Travel Awards and Scholarships from ASU, NUS, and BUAA.

Dr. Huan Liu is a professor of Computer Science and Engineering at Arizona State University. He obtained his PhD degree in Computer Science at the University of Southern California and BEng degree in Computer Science and Electrical Engineering at Shanghai Jiao Tong University. His research focus is centered on investigating problems that arise in many realworld applications with high-dimensional data of disparate forms such as analyzing social media, group interaction and modeling, feature selection, and text/web mining. His wellcited publications include books, book chapters, encyclopedia entries as well as conference and journal papers.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, J., Wang, X., Gao, H. et al. Enriching short text representation in microblog for clustering. Front. Comput. Sci. 6, 88–101 (2012). https://doi.org/10.1007/s11704-011-1167-7

Download citation

Received: 30 September 2011
Accepted: 07 November 2011
Published: 27 January 2012
Issue Date: February 2012
DOI: https://doi.org/10.1007/s11704-011-1167-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enriching short text representation in microblog for clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Filtering-Based Text Sentiment Analysis for Twitter Dataset

A novel framework based on bi-objective optimization and LAN2FIS for Twitter sentiment analysis

Sentiment Analysis of Arabic and English Tweets

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

A novel framework based on bi-objective optimization and LAN²FIS for Twitter sentiment analysis