A Comparative Study of Methods for Topic Modelling in News Articles

Rajan, Swapna D.; Coombs, Thomas; Jayabalan, Manoj; Ismail, Noor Azma

doi:10.1007/978-981-97-0293-0_20

Swapna D. Rajan⁵,
Thomas Coombs⁶,
Manoj Jayabalan⁵ &
…
Noor Azma Ismail⁷

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 191))

Included in the following conference series:

The International Conference on Data Science and Emerging Technologies

23 Accesses

Abstract

The past few decades have seen an increase in textual data and influence from the news media. With the rise in available data, especially in regard to textual data from news media, it is imperative to quickly categorise news topics. In this research, the primary aim is to suggest a method for automatically identifying news topics in articles. The dataset used in this research was the news category published on Kaggle and comprised of 210,294 headlines and abstracts from HuffPost between 2012 and 2022. The dataset consisted of a total of 42 categories and six columns. Traditional modelling techniques did not perform well in comparison with Top2Vec, NMF or BERTopic. This research confirms the efficacy of Top2Vec and BERTopic, followed by NMF, LDA and LSA for analysing, news category data from a human-interpretation perspective. Though BERTopic was able to deduce 1145 topics from the data, it could not chuck unwanted words like “to”, “say”, “for” which do not add any value to the topic semantics. In summary, TF-IDF proved to be the best feature extraction technique and Top2Vec the best topic modelling technique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhou Z, Qin J, Xiang X, Tan Y, Liu Q, Xiong NN (2020) News text topic clustering optimized method based on TF-iDF algorithm on spark. Comput Mater Continua 621:217–231
Article Google Scholar
Van Dijk TA (1995) Power and the news media. Polit Commun Action 6(1):9–36
Google Scholar
Schudson M (2002) The news media as political institutions. Annu Rev Polit Sci 5(1):249–269
Article Google Scholar
Holt K, Ustad Figenschou T, Frischlich L (2019) Key dimensions of alternative news media. Digit J 7(7):860–869
Google Scholar
Alam KM, Hemel MTH, Muhaiminul Islam SM, Akther A (2020) Bangla news trend observation using LDA based topic modeling. In: ICCIT 2020—23rd international conference on computer and information technology, proceedings, pp 19–21
Google Scholar
Xia L, Luo D, Zhang C, Wu Z (2019) A survey of topic models in text classification. In: 2019 2nd international conference on artificial intelligence and big data, ICAIBD 2019, pp 244–250
Google Scholar
Zosa E, Granroth-Wilding M (2019) Multilingual dynamic topic model. In: International conference recent advances in natural language processing, RANLP, 2019-Septe, pp 1388–1396
Google Scholar
Tabassum A, Patil RR (2020) A survey on text pre-processing & feature extraction techniques in natural language processing. Int Res J Eng Technol [online] 4864–4867. Available at: www.irjet.net
Sethia K, Saxena M, Goyal M, Yadav RK (2022) Framework for topic modeling using BERT, LDA and K-means. In: 2022 2nd international conference on advance computing and innovative technologies in engineering, ICACITE 2022, pp 2204–2208
Google Scholar
Rahmawati D, Khodra ML (2016) Word2vec semantic representation in multilabel classification for Indonesian news article. In: 4th IGNITE conference and 2016 international conference on advanced informatics: concepts, theory and application, ICAICTA 2016, pp 0–5
Google Scholar
Qiang J, Qian Z, Li Y, Yuan Y, Wu X (2022) Short text topic modeling techniques, applications, and performance: a survey. IEEE Trans Knowl Data Eng 343:1427–1445
Article Google Scholar
Rahamat Basha S, Rani JK (2019) A comparative approach of dimensionality reduction techniques in text classification. Eng Technol Appl Sci Res 96:4974–4979
Article Google Scholar
Bansal S, Srivastava A, Arora A (2017) Topic modeling driven content based jobs recommendation engine for recruitment industry. Procedia Comput Sci [online] 122:865–872. Available at: https://doi.org/10.1016/j.procs.2017.11.448
Blei DM, Ng AY, Jordan MI (2003) LDA-blei.pdf. J Mach Learn Res 3:993–102
Google Scholar
Chen F, Xie S, Li X, Li S, Tang J, Wang T (2019a) What topics do images say: a 70 neural image captioning model with topic representation. In: Proceedings—2019 IEEE international conference on multimedia and expo workshops, ICMEW 2019, pp 447–452
Google Scholar
Deng X, Smith R, Quintin G (2020) Semi-supervised learning approach to discover enterprise user insights from feedback and support [online]. Available at: http://arxiv.org/abs/2007.09303
Zhang F, Gao W, Fang Y (2019) News title classification based on sentence-LDA model and word embedding. In: Proceedings—2019 international conference on machine learning, big data and business intelligence, MLBDBI 2019, pp 237–240
Google Scholar
Mohamed AHH, Tawfik H, Norton L, Al-Jumeily D (2011, April) e-HTAM: a technology acceptance model for electronic health. In: 2011 international conference on innovations in information technology, pp 134–138. IEEE
Google Scholar
Al-Jumeily D, Hussain A, Alghamdi M, Dobbins C, Lunn J (2015) Educational crowdsourcing to support the learning of computer programming. Res Pract Technol Enhanc Learn 10:1–15
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank UNITAR International University for supporting the publication of this paper.

Author information

Authors and Affiliations

Liverpool John Moores University, Liverpool, UK
Swapna D. Rajan & Manoj Jayabalan
British American Tobacco, Southampton, UK
Thomas Coombs
UNITAR International University, Petaling Jaya, Selangor, Malaysia
Noor Azma Ismail

Authors

Swapna D. Rajan
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Coombs
View author publications
You can also search for this author in PubMed Google Scholar
Manoj Jayabalan
View author publications
You can also search for this author in PubMed Google Scholar
Noor Azma Ismail
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Coombs .

Editor information

Editors and Affiliations

UNITAR Graduate School, UNITAR International University, Petaling Jaya, Malaysia
Yap Bee Wah
Faculty of Engineering and Technology, Liverpool John Moores University, Liverpool, UK
Dhiya Al-Jumeily OBE
University of Tennessee, Knoxville, TN, USA
Michael W. Berry

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rajan, S.D., Coombs, T., Jayabalan, M., Ismail, N.A. (2024). A Comparative Study of Methods for Topic Modelling in News Articles. In: Bee Wah, Y., Al-Jumeily OBE, D., Berry, M.W. (eds) Data Science and Emerging Technologies. DaSET 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 191. Springer, Singapore. https://doi.org/10.1007/978-981-97-0293-0_20

Download citation

DOI: https://doi.org/10.1007/978-981-97-0293-0_20
Published: 27 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0292-3
Online ISBN: 978-981-97-0293-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics