skip to main content
10.1145/3155133.3155206acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

News Classification from Social Media Using Twitter-based Doc2Vec Model and Automatic Query Expansion

Published: 07 December 2017 Publication History

Abstract

News classification is among essential needs for people to organize, better understand, and utilize information from the Internet. This motivates the authors to propose a novel method to classify news from social media. First, we propose to vectorize an article with TD2V, our pre-trained Twitter-based universal document representation following Doc2Vec approach. We then define Modified Distance to better measure the semantic distance between two document vectors. Finally, we apply retrieval and automatic query expansion to get the most relevant labeled documents in a corpus to determine the category for a new article. As our TD2V is created from 297 million words in 420,351 news articles from more than one million tweets in Twitters from 2010 to 2017, it can be used as one of the efficient pre-trained models for English document representation in various applications. Experiments on datasets from different online sources show that our method achieves the classification accuracy better than existing methods, specifically 98.4±0.3% (BBC dataset), 98.9±0.7% (BBC Sport dataset), 94.1±0.2% (Amazon4 dataset), and 78.6% (20NewsGroup dataset). Furthermore, in the classification training process, we just encode all articles in the training set with TD2V, not to train a dedicated classification model for each of these datasets.

References

[1]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (March 2003), 993--1022.
[2]
John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, Prague, Czech Republic, 440--447.
[3]
Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML (2008-08-14) (ACM International Conference Proceeding Series), William W. Cohen, Andrew McCallum, and Sam T. Roweis (Eds.), Vol. 307. ACM, 160--167.
[4]
Derek Greene and Pádraig Cunningham. 2006. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. In Proceedings of the 23rd International Conference on Machine Learning (ICML '06). ACM, New York, NY, USA, 377--384.
[5]
Zellig Harris. 1954. Distributional structure. Word 10, 23 (1954), 146--162.
[6]
Thorsten Joachims. 1998. Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. In Proceedings of the 10th European Conference on Machine Learning (ECML '98). Springer-Verlag, London, UK, UK, 137--142.
[7]
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings to Document Distances. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, 957--966.
[8]
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings to Document Distances. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, 957--966.
[9]
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional Neural Networks for Text Classification. In AAAI, Blai Bonet and Sven Koenig (Eds.), Vol. 333. 2267--2273.
[10]
T.K. Landauer, P.W. Foltz, and D. Laham. 1998. An introduction to latent semantic analysis. Discourse processes 25 (1998), 259--284.
[11]
Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In ICML (JMLR Workshop and Conference Proceedings), Vol. 32. JMLR.org, 1188--1196.
[12]
Larry M. Manevitz and Malik Yousef. 2002. One-class Svms for Document Classification. J. Mach. Learn. Res. 2 (March 2002), 139--154.
[13]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).
[14]
Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Journal of Cognitive Science 34, 1 (2010), 1388--1429.
[15]
Mandar Mitra, Amit Singhal, and Chris Buckley. 1998. Improving Automatic Query Expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98). ACM, New York, NY, USA, 206--214.
[16]
Rodrigo Moraes, JoÃčo Francisco Valiati, and Wilson P. GaviÃčo Neto. 2013. Document-level sentiment classification: An empirical comparison between SVM and ANN. Expert Syst. Appl. 40, 2 (2013), 621--633.
[17]
Pratiksha Y. Pawar and S. H. Gawande. 2012. A Comparative Study on Different Types of Approaches to Text Categorization. International Journal of Machine Learning and Computing 2, 4 (2012), 423--426.
[18]
Michael Pazzani and Daniel Billsus. 1997. Learning and Revising User Profiles: The Identification ofInteresting Web Sites. Mach. Learn. 27, 3 (June 1997), 313--331.
[19]
Irina Rish. 2001. An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence, Vol. 3. IBM New York, 41--46.
[20]
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by back-propagating errors. Nature 323, 6088 (Oct. 1986), 533--536.
[21]
F. Sebastiani. 2002. Machine learning in automated text categorization. Comput. Surveys 34, 1 (2002), 1--47.
[22]
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded Compositional Semantics for Finding and Describing Images with Sentences. TACL 2 (2014), 207--218.
[23]
Peter D Turney, Patrick Pantel, et al. 2010. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 37, 1 (2010), 141--188.
[24]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. 2016. Hierarchical Attention Networks for Document Classification. In HLT-NAACL.
[25]
Rui Zhao and Kezhi Mao. 2017. Fuzzy Bag-of-Words Model for Document Representation. IEEE Transactions on Fuzzy Systems PP (2017). Issue 99.

Cited By

View all
  • (2023)ON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONEskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering10.18038/estubtda.117500124:1(23-34)Online publication date: 29-Mar-2023
  • (2023)HIN-RNN: A Graph Representation Learning Neural Network for Fraudster Group Detection With No Handcrafted FeaturesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.312387634:8(4153-4166)Online publication date: Aug-2023
  • (2022)Text Classification Using Document-Relational Graph Convolutional NetworksIEEE Access10.1109/ACCESS.2022.322182010(123205-123211)Online publication date: 2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SoICT '17: Proceedings of the 8th International Symposium on Information and Communication Technology
December 2017
486 pages
ISBN:9781450353281
DOI:10.1145/3155133
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • SOICT: School of Information and Communication Technology - HUST
  • NAFOSTED: The National Foundation for Science and Technology Development

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Doc2Vec
  2. Twitter
  3. automatic query expansion
  4. document embedding
  5. news classification

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SoICT 2017

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)6
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)ON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONON THE EFFECTIVENESS OF PARAGRAPH VECTOR MODELS IN DOCUMENT SIMILARITY ESTIMATION FOR TURKISH NEWS CATEGORIZATIONEskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering10.18038/estubtda.117500124:1(23-34)Online publication date: 29-Mar-2023
  • (2023)HIN-RNN: A Graph Representation Learning Neural Network for Fraudster Group Detection With No Handcrafted FeaturesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.312387634:8(4153-4166)Online publication date: Aug-2023
  • (2022)Text Classification Using Document-Relational Graph Convolutional NetworksIEEE Access10.1109/ACCESS.2022.322182010(123205-123211)Online publication date: 2022
  • (2022)Leveraging Multiple Representations of Topic Models for Knowledge DiscoveryIEEE Access10.1109/ACCESS.2022.321052910(104696-104705)Online publication date: 2022
  • (2022)Automating assessment of design examsExpert Systems with Applications: An International Journal10.1016/j.eswa.2021.116108189:COnline publication date: 1-Mar-2022
  • (2022)Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansionJournal of Heuristics10.1007/s10732-019-09417-w28:2(211-233)Online publication date: 1-Apr-2022
  • (2022)Predicting the Usefulness of Questions in Q&A Communities: A Comparison of Classical Machine Learning and Deep Learning ApproachesHCI in Business, Government and Organizations10.1007/978-3-031-05544-7_12(153-162)Online publication date: 16-Jun-2022
  • (2021)Digital Transformation in the Russian Federation: Thematic Landscape of Online Communities2021 30th Conference of Open Innovations Association FRUCT10.23919/FRUCT53335.2021.9599983(285-291)Online publication date: 27-Oct-2021
  • (2021)Doc2Vec-based Approach for Extracting Diverse Evaluation Expressions from Online Review DataThe 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487773(11-18)Online publication date: 29-Nov-2021
  • (2021)Anomaly Detection using Generative Adversarial Networks on Firewall Log Message Data2021 13th International Conference on Electronics, Computers and Artificial Intelligence (ECAI)10.1109/ECAI52376.2021.9515086(1-6)Online publication date: 1-Jul-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media