research-article

News Classification from Social Media Using Twitter-based Doc2Vec Model and Automatic Query Expansion

Authors:

Minh-Triet TranAuthors Info & Claims

SoICT '17: Proceedings of the 8th International Symposium on Information and Communication Technology

Pages 460 - 467

https://doi.org/10.1145/3155133.3155206

Published: 07 December 2017 Publication History

Abstract

News classification is among essential needs for people to organize, better understand, and utilize information from the Internet. This motivates the authors to propose a novel method to classify news from social media. First, we propose to vectorize an article with TD2V, our pre-trained Twitter-based universal document representation following Doc2Vec approach. We then define Modified Distance to better measure the semantic distance between two document vectors. Finally, we apply retrieval and automatic query expansion to get the most relevant labeled documents in a corpus to determine the category for a new article. As our TD2V is created from 297 million words in 420,351 news articles from more than one million tweets in Twitters from 2010 to 2017, it can be used as one of the efficient pre-trained models for English document representation in various applications. Experiments on datasets from different online sources show that our method achieves the classification accuracy better than existing methods, specifically 98.4±0.3% (BBC dataset), 98.9±0.7% (BBC Sport dataset), 94.1±0.2% (Amazon4 dataset), and 78.6% (20NewsGroup dataset). Furthermore, in the classification training process, we just encode all articles in the training set with TD2V, not to train a dedicated classification model for each of these datasets.

References

[1]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993--1022.

[2]

John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, Prague, Czech Republic, 440--447.

[3]

Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML (2008-08-14) (ACM International Conference Proceeding Series), William W. Cohen, Andrew McCallum, and Sam T. Roweis (Eds.), Vol. 307. ACM, 160--167.

Digital Library

[4]

Derek Greene and Pádraig Cunningham. 2006. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. In Proceedings of the 23rd International Conference on Machine Learning (ICML '06). ACM, New York, NY, USA, 377--384.

Digital Library

[5]

Zellig Harris. 1954. Distributional structure. Word 10, 23 (1954), 146--162.

[6]

Thorsten Joachims. 1998. Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. In Proceedings of the 10th European Conference on Machine Learning (ECML '98). Springer-Verlag, London, UK, UK, 137--142.

Digital Library

[7]

Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings to Document Distances. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, 957--966.

Digital Library

[8]

Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings to Document Distances. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, 957--966.

Digital Library

[9]

Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional Neural Networks for Text Classification. In AAAI, Blai Bonet and Sven Koenig (Eds.), Vol. 333. 2267--2273.

Digital Library

[10]

T.K. Landauer, P.W. Foltz, and D. Laham. 1998. An introduction to latent semantic analysis. Discourse processes 25 (1998), 259--284.

[11]

Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In ICML (JMLR Workshop and Conference Proceedings), Vol. 32. JMLR.org, 1188--1196.

Digital Library

[12]

Larry M. Manevitz and Malik Yousef. 2002. One-class Svms for Document Classification. J. Mach. Learn. Res. 2 (March 2002), 139--154.

Digital Library

[13]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).

[14]

Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Journal of Cognitive Science 34, 1 (2010), 1388--1429.

[15]

Mandar Mitra, Amit Singhal, and Chris Buckley. 1998. Improving Automatic Query Expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98). ACM, New York, NY, USA, 206--214.

Digital Library

[16]

Rodrigo Moraes, JoÃčo Francisco Valiati, and Wilson P. GaviÃčo Neto. 2013. Document-level sentiment classification: An empirical comparison between SVM and ANN. Expert Syst. Appl. 40, 2 (2013), 621--633.

Digital Library

[17]

Pratiksha Y. Pawar and S. H. Gawande. 2012. A Comparative Study on Different Types of Approaches to Text Categorization. International Journal of Machine Learning and Computing 2, 4 (2012), 423--426.

[18]

Michael Pazzani and Daniel Billsus. 1997. Learning and Revising User Profiles: The Identification ofInteresting Web Sites. Mach. Learn. 27, 3 (June 1997), 313--331.

Digital Library

[19]

Irina Rish. 2001. An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence, Vol. 3. IBM New York, 41--46.

[20]

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by back-propagating errors. Nature 323, 6088 (Oct. 1986), 533--536.

[21]

F. Sebastiani. 2002. Machine learning in automated text categorization. Comput. Surveys 34, 1 (2002), 1--47.

Digital Library

[22]

Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded Compositional Semantics for Finding and Describing Images with Sentences. TACL 2 (2014), 207--218.

[23]

Peter D Turney, Patrick Pantel, et al. 2010. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 37, 1 (2010), 141--188.

[24]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. 2016. Hierarchical Attention Networks for Document Classification. In HLT-NAACL.

[25]

Rui Zhao and Kezhi Mao. 2017. Fuzzy Bag-of-Words Model for Document Representation. IEEE Transactions on Fuzzy Systems PP (2017). Issue 99.

Cited By

YÜREKLİ A(2023)ON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONEskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering10.18038/estubtda.117500124:1(23-34)Online publication date: 29-Mar-2023
https://doi.org/10.18038/estubtda.1175001
Shehnepoor STogneri RLiu WBennamoun M(2023)HIN-RNN: A Graph Representation Learning Neural Network for Fraudster Group Detection With No Handcrafted FeaturesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.312387634:8(4153-4166)Online publication date: Aug-2023
https://doi.org/10.1109/TNNLS.2021.3123876
Liu CWang XXu H(2022)Text Classification Using Document-Relational Graph Convolutional NetworksIEEE Access10.1109/ACCESS.2022.322182010(123205-123211)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3221820
Show More Cited By

Index Terms

News Classification from Social Media Using Twitter-based Doc2Vec Model and Automatic Query Expansion
1. Information systems
  1. Information retrieval
  2. World Wide Web
    1. Web searching and information discovery
      1. Content ranking

Recommendations

Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansion
Abstract
Document vectorization with an appropriate encoding scheme is an essential component in various document processing tasks, including text document classification, retrieval, or generation. Training a dedicated document in a specific domain may ...
What is Twitter, a social network or a news media?
WWW '10: Proceedings of the 19th international conference on World wide web

Twitter, a microblogging service less than three years old, commands more than 41 million users as of July 2009 and is growing fast. Twitter users tweet about any topic within the 140-character limit and follow others to receive their tweets. The goal ...
Breaking news on twitter
CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

After the news of Osama Bin Laden's death leaked through Twitter, many people wondered if Twitter would fundamentally change the way we produce, spread, and consume news. In this paper we provide an in-depth analysis of how the news broke and spread on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SoICT '17: Proceedings of the 8th International Symposium on Information and Communication Technology

December 2017

486 pages

ISBN:9781450353281

DOI:10.1145/3155133

General Chairs:
Huynh Quyet Thang
HUST, Vietnam
,
Zhenjiang Hu
NII, Japan
,
Program Chairs:
Marc Bui
EPHE, France
,
Biplab Sikdar
NUS, Singapore
,
Ichiro IDE
Nagoya, Japan
,
Huynh Thi Thanh Binh
HUST, Vietnam
,
Publications Chairs:
Worrawat Engchuan
Canada
,
Dinh Viet Sang
HUST, Vietnam
,
Nguyen Thi Oanh
HUST, Vietnam

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SOICT: School of Information and Communication Technology - HUST
NAFOSTED: The National Foundation for Science and Technology Development

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SoICT 2017

SoICT 2017: The Eighth International Symposium on Information and Communication Technology

December 7 - 8, 2017

Nha Trang City, Viet Nam

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
726
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)6

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

YÜREKLİ A(2023)ON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONEskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering10.18038/estubtda.117500124:1(23-34)Online publication date: 29-Mar-2023
https://doi.org/10.18038/estubtda.1175001
Shehnepoor STogneri RLiu WBennamoun M(2023)HIN-RNN: A Graph Representation Learning Neural Network for Fraudster Group Detection With No Handcrafted FeaturesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.312387634:8(4153-4166)Online publication date: Aug-2023
https://doi.org/10.1109/TNNLS.2021.3123876
Liu CWang XXu H(2022)Text Classification Using Document-Relational Graph Convolutional NetworksIEEE Access10.1109/ACCESS.2022.322182010(123205-123211)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3221820
Potts CSavaliya AJhala A(2022)Leveraging Multiple Representations of Topic Models for Knowledge DiscoveryIEEE Access10.1109/ACCESS.2022.321052910(104696-104705)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3210529
Chaudhuri NDhar DYammiyavar P(2022)Automating assessment of design examsExpert Systems with Applications: An International Journal10.1016/j.eswa.2021.116108189:COnline publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1016/j.eswa.2021.116108
Tran MTrieu LTran H(2022)Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansionJournal of Heuristics10.1007/s10732-019-09417-w28:2(211-233)Online publication date: 1-Apr-2022
https://dl.acm.org/doi/10.1007/s10732-019-09417-w
Chen L(2022)Predicting the Usefulness of Questions in Q&A Communities: A Comparison of Classical Machine Learning and Deep Learning ApproachesHCI in Business, Government and Organizations10.1007/978-3-031-05544-7_12(153-162)Online publication date: 16-Jun-2022
https://doi.org/10.1007/978-3-031-05544-7_12
Svetlov KLegostaeva N(2021)Digital Transformation in the Russian Federation: Thematic Landscape of Online Communities2021 30th Conference of Open Innovations Association FRUCT10.23919/FRUCT53335.2021.9599983(285-291)Online publication date: 27-Oct-2021
https://doi.org/10.23919/FRUCT53335.2021.9599983
Kurihara KShoji YFujita SDürst M(2021)Doc2Vec-based Approach for Extracting Diverse Evaluation Expressions from Online Review DataThe 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487773(11-18)Online publication date: 29-Nov-2021
https://dl.acm.org/doi/10.1145/3487664.3487773
Kulyadi SMohandas PKumar SRaman MVasan V(2021)Anomaly Detection using Generative Adversarial Networks on Firewall Log Message Data2021 13th International Conference on Electronics, Computers and Artificial Intelligence (ECAI)10.1109/ECAI52376.2021.9515086(1-6)Online publication date: 1-Jul-2021
https://doi.org/10.1109/ECAI52376.2021.9515086
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten