Abstract
Natural language processing is a widely used application in research and industry. Amongst other, use cases are sentiment analysis, speech recognition, classification, query answering and machine translation. In this research we investigate widely applied preprocessing methods, to improve the results of different Algorithms trained on a Fake News data set. As feature extraction methods we compared TF-IDF and Count-Vectorization. TF-IDF yielded slightly better results in terms of accuracy. We found that, as opposed to current research, stemming leads to a minor increase of false positive and false negative classifications, hence to a decrease in accuracy. Among the compared models, logistic regression and support vector machine yielded the best results.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The dataset is not completely balanced and contains more true news than fake news.
References
SPI manifesto. https://2020.eurospi.net/index.php/manifesto. Accessed 01 Feb 2021
Bovet, A., Makse, H.A.: Influence of fake news in Twitter during the 2016 US presidential election. Nat Commun 10(1), 7 (2019). https://doi.org/10.1038/s41467-018-07761-2
Boldyreva, E.L.: Cambridge analytica: ethics and online manipulation with decision-making process, pp. 91–102, December 2018. https://doi.org/10.15405/epsbs.2018.12.02.10
Goodman, S.K.: Information needs for management decision-making. ARMA Rec. Manage. Q. 27(4), 12 (1993)
Fake and real news dataset. https://kaggle.com/clmentbisaillon/fake-and-real-news-dataset. Accessed 05 May 2020
Biba, M., Gjati, E.: Boosting Text Classification through Stemming of Composite Words. In: Thampi, S.M., Abraham, A., Pal, S.K., Rodriguez, J.M.C. (eds.) Recent Advances in Intelligent Informatics, vol. 235, pp. 185–194. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-01778-5_19
Hakim, A.A., Erwin, A., Eng, K.I., Galinium, M., Muliady, W.: Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach. In; 2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia, pp. 1–4, October 2014. https://doi.org/10.1109/ICITEED.2014.7007894
Dasgupta, S., Goldberg, Y., Kosorok, M.: Feature elimination in kernel machines in moderately high dimensions. arXiv:1304.5245 [stat], December 2015. http://arxiv.org/abs/1304.5245. Accessed 07 May 2020
Scikit-learn: machine learning in Python—scikit-learn 0.22.2 documentation. https://scikit-learn.org/stable/. Accessed 12 May 2020
Keras: the Python deep learning API. https://keras.io/. Accessed 12 May 2020
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019). https://doi.org/10.3390/info10040150
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning based text classification: a comprehensive review. arXiv:2004.03705 [cs, stat], April 2020. http://arxiv.org/abs/2004.03705. Accessed 12 May 2020
A Beginner’s Guide to Bag of Words & TF-IDF. Pathmind. http://pathmind.com/wiki/bagofwords-tf-idf. Accessed 12 May 2020
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Deep learning. In: Data Mining, pp. 417–466. Elsevier (2017). https://doi.org/10.1016/B978-0-12-804291-5.00010-6
Sharma, H., Kumar, S.: A survey on decision tree algorithms of classification in data mining, April 2016. https://www.researchgate.net/publication/324941161_A_Survey_on_Decision_Tree_Algorithms_of_Classification_in_Data_Mining
Poddar, K., Amali D, G.B., Umadevi, K.S.: Comparison of various machine learning models for accurate detection of fake news. In: 2019 Innovations in Power and Advanced Computing Technologies (i-PACT), Vellore, India, pp. 1–5, March 2019. https://doi.org/10.1109/i-PACT44901.2019.8960044.
Aphiwongsophon, S., Chongstitvatana, P.: Detecting fake news with machine learning method. In: 2018 15th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Chiang Rai, Thailand, pp. 528–531, July 2018. https://doi.org/10.1109/ECTICon.2018.8620051
Ahmed, H., Traore, I., Saad, S.: Detection of online fake news using n-gram analysis and machine learning techniques. In: Traore, I., Woungang, I., Awad, A. (eds.) ISDDC 2017. LNCS, vol. 10618, pp. 127–138. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69155-8_9
Regularization for Simplicity: Lambda | Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/lambda. Accessed 12 May 2020
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Wendland, A., Zenere, M., Niemann, J. (2021). Introduction to Text Classification: Impact of Stemming and Comparing TF-IDF and Count Vectorization as Feature Extraction Technique. In: Yilmaz, M., Clarke, P., Messnarz, R., Reiner, M. (eds) Systems, Software and Services Process Improvement. EuroSPI 2021. Communications in Computer and Information Science, vol 1442. Springer, Cham. https://doi.org/10.1007/978-3-030-85521-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-85521-5_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85520-8
Online ISBN: 978-3-030-85521-5
eBook Packages: Computer ScienceComputer Science (R0)