Abstract
The past few decades have seen an increase in textual data and influence from the news media. With the rise in available data, especially in regard to textual data from news media, it is imperative to quickly categorise news topics. In this research, the primary aim is to suggest a method for automatically identifying news topics in articles. The dataset used in this research was the news category published on Kaggle and comprised of 210,294 headlines and abstracts from HuffPost between 2012 and 2022. The dataset consisted of a total of 42 categories and six columns. Traditional modelling techniques did not perform well in comparison with Top2Vec, NMF or BERTopic. This research confirms the efficacy of Top2Vec and BERTopic, followed by NMF, LDA and LSA for analysing, news category data from a human-interpretation perspective. Though BERTopic was able to deduce 1145 topics from the data, it could not chuck unwanted words like “to”, “say”, “for” which do not add any value to the topic semantics. In summary, TF-IDF proved to be the best feature extraction technique and Top2Vec the best topic modelling technique.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhou Z, Qin J, Xiang X, Tan Y, Liu Q, Xiong NN (2020) News text topic clustering optimized method based on TF-iDF algorithm on spark. Comput Mater Continua 621:217–231
Van Dijk TA (1995) Power and the news media. Polit Commun Action 6(1):9–36
Schudson M (2002) The news media as political institutions. Annu Rev Polit Sci 5(1):249–269
Holt K, Ustad Figenschou T, Frischlich L (2019) Key dimensions of alternative news media. Digit J 7(7):860–869
Alam KM, Hemel MTH, Muhaiminul Islam SM, Akther A (2020) Bangla news trend observation using LDA based topic modeling. In: ICCIT 2020—23rd international conference on computer and information technology, proceedings, pp 19–21
Xia L, Luo D, Zhang C, Wu Z (2019) A survey of topic models in text classification. In: 2019 2nd international conference on artificial intelligence and big data, ICAIBD 2019, pp 244–250
Zosa E, Granroth-Wilding M (2019) Multilingual dynamic topic model. In: International conference recent advances in natural language processing, RANLP, 2019-Septe, pp 1388–1396
Tabassum A, Patil RR (2020) A survey on text pre-processing & feature extraction techniques in natural language processing. Int Res J Eng Technol [online] 4864–4867. Available at: www.irjet.net
Sethia K, Saxena M, Goyal M, Yadav RK (2022) Framework for topic modeling using BERT, LDA and K-means. In: 2022 2nd international conference on advance computing and innovative technologies in engineering, ICACITE 2022, pp 2204–2208
Rahmawati D, Khodra ML (2016) Word2vec semantic representation in multilabel classification for Indonesian news article. In: 4th IGNITE conference and 2016 international conference on advanced informatics: concepts, theory and application, ICAICTA 2016, pp 0–5
Qiang J, Qian Z, Li Y, Yuan Y, Wu X (2022) Short text topic modeling techniques, applications, and performance: a survey. IEEE Trans Knowl Data Eng 343:1427–1445
Rahamat Basha S, Rani JK (2019) A comparative approach of dimensionality reduction techniques in text classification. Eng Technol Appl Sci Res 96:4974–4979
Bansal S, Srivastava A, Arora A (2017) Topic modeling driven content based jobs recommendation engine for recruitment industry. Procedia Comput Sci [online] 122:865–872. Available at: https://doi.org/10.1016/j.procs.2017.11.448
Blei DM, Ng AY, Jordan MI (2003) LDA-blei.pdf. J Mach Learn Res 3:993–102
Chen F, Xie S, Li X, Li S, Tang J, Wang T (2019a) What topics do images say: a 70 neural image captioning model with topic representation. In: Proceedings—2019 IEEE international conference on multimedia and expo workshops, ICMEW 2019, pp 447–452
Deng X, Smith R, Quintin G (2020) Semi-supervised learning approach to discover enterprise user insights from feedback and support [online]. Available at: http://arxiv.org/abs/2007.09303
Zhang F, Gao W, Fang Y (2019) News title classification based on sentence-LDA model and word embedding. In: Proceedings—2019 international conference on machine learning, big data and business intelligence, MLBDBI 2019, pp 237–240
Mohamed AHH, Tawfik H, Norton L, Al-Jumeily D (2011, April) e-HTAM: a technology acceptance model for electronic health. In: 2011 international conference on innovations in information technology, pp 134–138. IEEE
Al-Jumeily D, Hussain A, Alghamdi M, Dobbins C, Lunn J (2015) Educational crowdsourcing to support the learning of computer programming. Res Pract Technol Enhanc Learn 10:1–15
Acknowledgements
The authors would like to thank UNITAR International University for supporting the publication of this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Rajan, S.D., Coombs, T., Jayabalan, M., Ismail, N.A. (2024). A Comparative Study of Methods for Topic Modelling in News Articles. In: Bee Wah, Y., Al-Jumeily OBE, D., Berry, M.W. (eds) Data Science and Emerging Technologies. DaSET 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 191. Springer, Singapore. https://doi.org/10.1007/978-981-97-0293-0_20
Download citation
DOI: https://doi.org/10.1007/978-981-97-0293-0_20
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0292-3
Online ISBN: 978-981-97-0293-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)