Incorporating Word Embedding and Hybrid Model Random Forest Softmax Regression for Predicting News Categories

Khosa, Saima; Rustam, Furqan; Mehmood, Arif; Choi, Gyu Sang; Ashraf, Imran

doi:10.1007/s11042-023-16491-7

Incorporating Word Embedding and Hybrid Model Random Forest Softmax Regression for Predicting News Categories

Published: 15 September 2023

Volume 83, pages 31279–31295, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Saima Khosa¹,
Furqan Rustam²,
Arif Mehmood³,
Gyu Sang Choi⁴ &
…
Imran Ashraf ORCID: orcid.org/0000-0002-8271-6496⁴

124 Accesses
Explore all metrics

Abstract

Online media reshaped the news industry leading to information richness, timely dissemination, and immense diversity. In addition, recent technological advancements enable on-spot, prompt and frequent reporting which can be viewed on smartphones, personal computers, and mobile devices. These recent developments enhanced the importance of news categorization. Accurate news categorization has become an important element to increase user satisfaction by providing the news of their interest and desired category. Despite the available approaches for news categorization, such approaches lack the desired accuracy and require further research to improve their performance. For this purpose, this research proposes a hybrid model that comprises random forest (RF) and SoftMax regression. To further increase the accuracy, special emphasis is placed on preprocessing steps to remove the noise from the textual data. Moreover, term frequency-inverse document frequency (TF-IDF) and bag of words (BoW) approaches are leveraged for the proposed model due to their reported efficacy for the task at hand. Experimental results indicate that the proposed model achieves 98.1% accuracy and outperforms individual machine learning classifiers regarding the accuracy, precision, recall, and F1 score. Hybrid approaches of RF and SMR tend to show better results than individual, as well as, state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

A Scheme for News Article Classification in a Low-Resource Language

Mizo News Classification Using Machine Learning Techniques

Natural Language Contents Evaluation System for Multi-class News Categorization Using Machine Learning and Transformers

Availability of Data and Material

The dataset used in this study is available at http://mlg.ucd.ie/datasets/bbc.html.

References

BBC (2022) Bbc news dataset available online. [Online]. Available: http://mlg.ucd.ie/datasets/bbc.html
Bíró I, Siklósi D, Szabó J, Benczúr AA (2009) Linked latent dirichlet allocation in web spam filtering. In: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, p 37–40
Bounabi M, El Moutaouakil K, Satori K (2017) A comparison of text classification methods method of weighted terms selected by different stemming techniques. In: Proceedings of the 2nd international Conference on Big Data, Cloud and Applications, p 1–9
Breiman L, Freidman J, Olshen R, Stone C (1984) Classification and regression trees. wadsworth, monterey, ca. Classification and regression trees. Wadsworth, Monterey, CA
Dadgar SMH, Araghi MS, Farahani MM (2016) A novel text mining approach based on tf-idf and support vector machine for news classification. In: 2016 IEEE International Conference on Engineering and Technology (ICETECH). IEEE, pp. 112–116
Dandeniya D (2018) An automatic e-news article content extraction and classification. In: 2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer). IEEE, 2018, pp. 196–202
Elghannam F (2019) Text representation and classification based on bi-gram alphabet. Journal of King Saud University-Computer and Information Sciences
Glorot X, Bordes A, Bengio Y (2011) Domain adaptation for large-scale sentiment classification: A deep learning approach
Gupta RK, Yang Y (2019) Predicting and understanding news social popularity with emotional salience features. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 139–147
Haryanto AW, Mawardi EK et al. (2018) Influence of word normalization and chi-squared feature selection on support vector machine (svm) text classification. In: 2018 International Seminar on Application for Technology of Information and Communication. IEEE, pp. 229–233
Kadhim AI, Cheah YN, Ahamed NH (2014) Text document preprocessing and dimension reduction techniques for text document clustering. In: 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology. IEEE, p 69–73. https://doi.org/10.1109/ICAIET.2014.2
Karaman Y, Akdeniz F, Savaş BK, Becerikli Y (2023) A comparative analysis of svm, lstm and cnn-rnn models for the bbc news classification. In: Innovations in Smart Cities Applications Volume 6: The Proceedings of the 7th International Conference on Smart City Applications. Springer, p 473–483
Khalid M, Ashraf I, Mehmood A, Ullah S, Ahmad M, Choi GS (2020) Gbsvm: Sentiment classification from unstructured reviews using ensemble classifier. Applied Sciences 10(8):2788
Article CAS Google Scholar
Kim D, Seo D, Cho S, Kang P (2019) Multi-co-training for document classification using various document representations: Tf-idf, lda, and doc2vec. Information Sciences 477:15–29
Article Google Scholar
Kjaerulff UB, Madsen AL (2008) Bayesian networks and influence diagrams. Springer Science+ Business Media, vol. 200, p. 114, 2008
Lee E, Rustam F, Ashraf I, Washington PB, Narra M, Shafique R (2022) Inquest of current situation in afghanistan under taliban rule using sentiment analysis and volume analysis. IEEE Access 10:10333–10348
Article Google Scholar
Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R news 2(3):18–22
Google Scholar
McCallum A, Nigam K, et al. (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization, vol. 752, no. 1. Citeseer, pp. 41–48
Mehmood A, On BW, Lee I, Ashraf I, Choi GS (2017) Spam comments prediction using stacking with ensemble learning. In: Journal of Physics: Conference Series, vol. 933, no. 1. IOP Publishing, p. 012012
Méndez JR, Iglesias EL, Fdez-Riverola F, Díaz F, Corchado JM (2005) Tokenising, stemming and stopword removal on anti-spam filtering domain. In: Conference of the Spanish Association for Artificial Intelligence. Springer, p 449–458
Neelakantan A, Shankar J, Passos A, McCallum A (2015) Efficient non-parametric estimation of multiple embeddings per word in vector space. arXiv:1504.06654
Osowska-Kurczab AM, Markiewicz T, Dziekiewicz M, Lorent M (2021) Multi-feature ensemble system in the renal tumour classification task. Bulletin of the Polish Academy of Sciences: Technical Sciences 69(3):e136749
Article Google Scholar
Pal M (2005) Random forest classifier for remote sensing classification. International Journal of Remote Sensing 26(1):217–222
Article ADS Google Scholar
Quinlan JR, C4. 5: programs for machine learning. Elsevier, 2014
Rana MI, Khalid S, Akbar MU (2014) News classification based on their headlines: A review. In: 17th IEEE International Multi Topic Conference 2014. IEEE, p 211–216
Reshi AA, Rustam F, Aljedaani W, Shafi S, Alhossan A, Alrabiah Z, Ahmad A, Alsuwailem H, Almangour TA, Alshammari MA et al. (2022) Covid-19 vaccination-related sentiments analysis: a case study using worldwide twitter dataset. In: Healthcare, vol. 10, no. 3. MDPI, p. 411
Rustam F, Ashraf I, Mehmood A, Ullah S, Choi GS (2019) Tweets classification on the base of sentiments for us airline companies. Entropy 21(11):1078
Article ADS PubMed Central Google Scholar
Rustam F, Mehmood A, Ahmad M, Ullah S, Khan DM, Choi GS (2020) Classification of shopify app user reviews using novel multi text features. EEE Access
Sadeghi D, Shoeibi A, Ghassemi N, Moridian P, Khadem A, Alizadehsani R, Teshnehlab M, Gorriz JM, Khozeimeh F, Zhang YD, Nahavandi S, Acharya UR (2022) An overview of artificial intelligence techniques for diagnosis of schizophrenia based on magnetic resonance imaging modalities: Methods, challenges, and future works. Computers in Biology and Medicine, vol. 146, p. 105554, [Online]. Available: https://doi.org/10.1016/j.compbiomed.2022.105554
Salman HA, Obaida TH (2021) Bbc news data classification using naïve bayes based on bag of word. Journal of Hunan University (NaturalSciences), vol. 48, no. 9
Shoeibi A, Khodatars M, Alizadehsani R, Ghassemi N, Jafari M, Moridian P, Khadem A, Sadeghi D, Hussain S, Zare A, Sani ZA, Bazeli J, Khozeimeh F, Khosravi A, Nahavandi S, Acharya UR, Gorriz JM (2022) Automated detection and forecasting of covid-19 using deep learning techniques: A review
Shoeibi A, Khodatars M, Jafari M, Moridian P, Rezaei M, Alizadehsani R, Khozeimeh F, Gorriz JM, Heras J, Panahiazar M, Nahavandi S, Acharya UR (2021) Applications of deep learning techniques for automated multiple sclerosis detection using magnetic resonance imaging: A review. Computers in Biology and Medicine, vol. 136, p. 104697, [Online]. Available: https://doi.org/10.1016/j.compbiomed.2021.104697
Tariq S, Akhtar N, Afzal H, Khalid S, Mufti MR, Hussain S, Habib A, Ahmad G (2019) A novel co-training-based approach for the classification of mental illnesses using social media posts. IEEE Access, vol. 7, p 166,165–166,172
Wongso R, Luwinda FA, Trisnajaya BC, Rusli O et al (2017) News article text classification in indonesian language. Procedia Comput Sci 116:137–143
Article Google Scholar
Zhang Y, Jin R, Zhou Z-H (2010) Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics 1(1–4):43–52
Article Google Scholar
Zhu W, Zhang W, Li G-Z, He C, Zhang L (2016) A study of damp-heat syndrome classification using word2vec and tf-idf. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, pp. 1415–1420

Download references

Funding

“This work was supported in part by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (NRF-2019R1A2C1006159) and (NRF- 2021R1A6A1A03039493).”

Author information

Authors and Affiliations

Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan, 64200, Pakistan
Saima Khosa
School of Computer Science, University College Dublin, D04 V1W8, Dublin, Ireland
Furqan Rustam
Department of CS and IT, The Islamia University of Bahawalpur, Punjab, 63100, Pakistan
Arif Mehmood
Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, 38541, South Korea
Gyu Sang Choi & Imran Ashraf

Authors

Saima Khosa
View author publications
You can also search for this author in PubMed Google Scholar
Furqan Rustam
View author publications
You can also search for this author in PubMed Google Scholar
Arif Mehmood
View author publications
You can also search for this author in PubMed Google Scholar
Gyu Sang Choi
View author publications
You can also search for this author in PubMed Google Scholar
Imran Ashraf
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Imran Ashraf.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Khosa, S., Rustam, F., Mehmood, A. et al. Incorporating Word Embedding and Hybrid Model Random Forest Softmax Regression for Predicting News Categories. Multimed Tools Appl 83, 31279–31295 (2024). https://doi.org/10.1007/s11042-023-16491-7

Download citation

Received: 17 November 2021
Revised: 16 May 2023
Accepted: 08 August 2023
Published: 15 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-16491-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incorporating Word Embedding and Hybrid Model Random Forest Softmax Regression for Predicting News Categories

Abstract

Access this article

Similar content being viewed by others

A Scheme for News Article Classification in a Low-Resource Language

Mizo News Classification Using Machine Learning Techniques

Natural Language Contents Evaluation System for Multi-class News Categorization Using Machine Learning and Transformers

Availability of Data and Material

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Incorporating Word Embedding and Hybrid Model Random Forest Softmax Regression for Predicting News Categories

Abstract

Access this article

Similar content being viewed by others

A Scheme for News Article Classification in a Low-Resource Language

Mizo News Classification Using Machine Learning Techniques

Natural Language Contents Evaluation System for Multi-class News Categorization Using Machine Learning and Transformers

Availability of Data and Material

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation