Skip to main content

A Comparative Study of Methods for Topic Modelling in News Articles

  • Conference paper
  • First Online:
Data Science and Emerging Technologies (DaSET 2023)

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 191))

Included in the following conference series:

  • 23 Accesses

Abstract

The past few decades have seen an increase in textual data and influence from the news media. With the rise in available data, especially in regard to textual data from news media, it is imperative to quickly categorise news topics. In this research, the primary aim is to suggest a method for automatically identifying news topics in articles. The dataset used in this research was the news category published on Kaggle and comprised of 210,294 headlines and abstracts from HuffPost between 2012 and 2022. The dataset consisted of a total of 42 categories and six columns. Traditional modelling techniques did not perform well in comparison with Top2Vec, NMF or BERTopic. This research confirms the efficacy of Top2Vec and BERTopic, followed by NMF, LDA and LSA for analysing, news category data from a human-interpretation perspective. Though BERTopic was able to deduce 1145 topics from the data, it could not chuck unwanted words like “to”, “say”, “for” which do not add any value to the topic semantics. In summary, TF-IDF proved to be the best feature extraction technique and Top2Vec the best topic modelling technique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhou Z, Qin J, Xiang X, Tan Y, Liu Q, Xiong NN (2020) News text topic clustering optimized method based on TF-iDF algorithm on spark. Comput Mater Continua 621:217–231

    Article  Google Scholar 

  2. Van Dijk TA (1995) Power and the news media. Polit Commun Action 6(1):9–36

    Google Scholar 

  3. Schudson M (2002) The news media as political institutions. Annu Rev Polit Sci 5(1):249–269

    Article  Google Scholar 

  4. Holt K, Ustad Figenschou T, Frischlich L (2019) Key dimensions of alternative news media. Digit J 7(7):860–869

    Google Scholar 

  5. Alam KM, Hemel MTH, Muhaiminul Islam SM, Akther A (2020) Bangla news trend observation using LDA based topic modeling. In: ICCIT 2020—23rd international conference on computer and information technology, proceedings, pp 19–21

    Google Scholar 

  6. Xia L, Luo D, Zhang C, Wu Z (2019) A survey of topic models in text classification. In: 2019 2nd international conference on artificial intelligence and big data, ICAIBD 2019, pp 244–250

    Google Scholar 

  7. Zosa E, Granroth-Wilding M (2019) Multilingual dynamic topic model. In: International conference recent advances in natural language processing, RANLP, 2019-Septe, pp 1388–1396

    Google Scholar 

  8. Tabassum A, Patil RR (2020) A survey on text pre-processing & feature extraction techniques in natural language processing. Int Res J Eng Technol [online] 4864–4867. Available at: www.irjet.net

  9. Sethia K, Saxena M, Goyal M, Yadav RK (2022) Framework for topic modeling using BERT, LDA and K-means. In: 2022 2nd international conference on advance computing and innovative technologies in engineering, ICACITE 2022, pp 2204–2208

    Google Scholar 

  10. Rahmawati D, Khodra ML (2016) Word2vec semantic representation in multilabel classification for Indonesian news article. In: 4th IGNITE conference and 2016 international conference on advanced informatics: concepts, theory and application, ICAICTA 2016, pp 0–5

    Google Scholar 

  11. Qiang J, Qian Z, Li Y, Yuan Y, Wu X (2022) Short text topic modeling techniques, applications, and performance: a survey. IEEE Trans Knowl Data Eng 343:1427–1445

    Article  Google Scholar 

  12. Rahamat Basha S, Rani JK (2019) A comparative approach of dimensionality reduction techniques in text classification. Eng Technol Appl Sci Res 96:4974–4979

    Article  Google Scholar 

  13. Bansal S, Srivastava A, Arora A (2017) Topic modeling driven content based jobs recommendation engine for recruitment industry. Procedia Comput Sci [online] 122:865–872. Available at: https://doi.org/10.1016/j.procs.2017.11.448

  14. Blei DM, Ng AY, Jordan MI (2003) LDA-blei.pdf. J Mach Learn Res 3:993–102

    Google Scholar 

  15. Chen F, Xie S, Li X, Li S, Tang J, Wang T (2019a) What topics do images say: a 70 neural image captioning model with topic representation. In: Proceedings—2019 IEEE international conference on multimedia and expo workshops, ICMEW 2019, pp 447–452

    Google Scholar 

  16. Deng X, Smith R, Quintin G (2020) Semi-supervised learning approach to discover enterprise user insights from feedback and support [online]. Available at: http://arxiv.org/abs/2007.09303

  17. Zhang F, Gao W, Fang Y (2019) News title classification based on sentence-LDA model and word embedding. In: Proceedings—2019 international conference on machine learning, big data and business intelligence, MLBDBI 2019, pp 237–240

    Google Scholar 

  18. Mohamed AHH, Tawfik H, Norton L, Al-Jumeily D (2011, April) e-HTAM: a technology acceptance model for electronic health. In: 2011 international conference on innovations in information technology, pp 134–138. IEEE

    Google Scholar 

  19. Al-Jumeily D, Hussain A, Alghamdi M, Dobbins C, Lunn J (2015) Educational crowdsourcing to support the learning of computer programming. Res Pract Technol Enhanc Learn 10:1–15

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank UNITAR International University for supporting the publication of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Coombs .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rajan, S.D., Coombs, T., Jayabalan, M., Ismail, N.A. (2024). A Comparative Study of Methods for Topic Modelling in News Articles. In: Bee Wah, Y., Al-Jumeily OBE, D., Berry, M.W. (eds) Data Science and Emerging Technologies. DaSET 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 191. Springer, Singapore. https://doi.org/10.1007/978-981-97-0293-0_20

Download citation

Publish with us

Policies and ethics