Introduction to Text Classification: Impact of Stemming and Comparing TF-IDF and Count Vectorization as Feature Extraction Technique

Wendland, André; Zenere, Marco; Niemann, Jörg

doi:10.1007/978-3-030-85521-5_19

Introduction to Text Classification: Impact of Stemming and Comparing TF-IDF and Count Vectorization as Feature Extraction Technique

André Wendland⁹,
Marco Zenere⁹ &
Jörg Niemann¹⁰

Conference paper
First Online: 25 August 2021

3192 Accesses
5 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1442))

Abstract

Natural language processing is a widely used application in research and industry. Amongst other, use cases are sentiment analysis, speech recognition, classification, query answering and machine translation. In this research we investigate widely applied preprocessing methods, to improve the results of different Algorithms trained on a Fake News data set. As feature extraction methods we compared TF-IDF and Count-Vectorization. TF-IDF yielded slightly better results in terms of accuracy. We found that, as opposed to current research, stemming leads to a minor increase of false positive and false negative classifications, hence to a decrease in accuracy. Among the compared models, logistic regression and support vector machine yielded the best results.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The dataset is not completely balanced and contains more true news than fake news.

References

SPI manifesto. https://2020.eurospi.net/index.php/manifesto. Accessed 01 Feb 2021
Bovet, A., Makse, H.A.: Influence of fake news in Twitter during the 2016 US presidential election. Nat Commun 10(1), 7 (2019). https://doi.org/10.1038/s41467-018-07761-2
Article Google Scholar
Boldyreva, E.L.: Cambridge analytica: ethics and online manipulation with decision-making process, pp. 91–102, December 2018. https://doi.org/10.15405/epsbs.2018.12.02.10
Goodman, S.K.: Information needs for management decision-making. ARMA Rec. Manage. Q. 27(4), 12 (1993)
Google Scholar
Fake and real news dataset. https://kaggle.com/clmentbisaillon/fake-and-real-news-dataset. Accessed 05 May 2020
Biba, M., Gjati, E.: Boosting Text Classification through Stemming of Composite Words. In: Thampi, S.M., Abraham, A., Pal, S.K., Rodriguez, J.M.C. (eds.) Recent Advances in Intelligent Informatics, vol. 235, pp. 185–194. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-01778-5_19
Chapter Google Scholar
Hakim, A.A., Erwin, A., Eng, K.I., Galinium, M., Muliady, W.: Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach. In; 2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia, pp. 1–4, October 2014. https://doi.org/10.1109/ICITEED.2014.7007894
Dasgupta, S., Goldberg, Y., Kosorok, M.: Feature elimination in kernel machines in moderately high dimensions. arXiv:1304.5245 [stat], December 2015. http://arxiv.org/abs/1304.5245. Accessed 07 May 2020
Scikit-learn: machine learning in Python—scikit-learn 0.22.2 documentation. https://scikit-learn.org/stable/. Accessed 12 May 2020
Keras: the Python deep learning API. https://keras.io/. Accessed 12 May 2020
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019). https://doi.org/10.3390/info10040150
Article Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683
Chapter Google Scholar
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning based text classification: a comprehensive review. arXiv:2004.03705 [cs, stat], April 2020. http://arxiv.org/abs/2004.03705. Accessed 12 May 2020
A Beginner’s Guide to Bag of Words & TF-IDF. Pathmind. http://pathmind.com/wiki/bagofwords-tf-idf. Accessed 12 May 2020
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Deep learning. In: Data Mining, pp. 417–466. Elsevier (2017). https://doi.org/10.1016/B978-0-12-804291-5.00010-6
Sharma, H., Kumar, S.: A survey on decision tree algorithms of classification in data mining, April 2016. https://www.researchgate.net/publication/324941161_A_Survey_on_Decision_Tree_Algorithms_of_Classification_in_Data_Mining
Poddar, K., Amali D, G.B., Umadevi, K.S.: Comparison of various machine learning models for accurate detection of fake news. In: 2019 Innovations in Power and Advanced Computing Technologies (i-PACT), Vellore, India, pp. 1–5, March 2019. https://doi.org/10.1109/i-PACT44901.2019.8960044.
Aphiwongsophon, S., Chongstitvatana, P.: Detecting fake news with machine learning method. In: 2018 15th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Chiang Rai, Thailand, pp. 528–531, July 2018. https://doi.org/10.1109/ECTICon.2018.8620051
Ahmed, H., Traore, I., Saad, S.: Detection of online fake news using n-gram analysis and machine learning techniques. In: Traore, I., Woungang, I., Awad, A. (eds.) ISDDC 2017. LNCS, vol. 10618, pp. 127–138. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69155-8_9
Chapter Google Scholar
Regularization for Simplicity: Lambda | Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/lambda. Accessed 12 May 2020

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Free University of Bozen Bolzano, Piazza Università, 1, 39100, Bolzano, BZ, Italy
André Wendland & Marco Zenere
Department of Mechanical and Process Engineering, University of Applied Sciences Düsseldorf, Münsterstraße 156, 40476, Düsseldorf, Germany
Jörg Niemann

Authors

André Wendland
View author publications
You can also search for this author in PubMed Google Scholar
Marco Zenere
View author publications
You can also search for this author in PubMed Google Scholar
Jörg Niemann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Gazi University, Ankara, Turkey
Murat Yilmaz
Dublin City University, Dublin, Ireland
Paul Clarke
I.S.C.N. GesmbH, Graz, Austria
Richard Messnarz
IMC University of Applied Sciences Krems, Krems, Austria
Michael Reiner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wendland, A., Zenere, M., Niemann, J. (2021). Introduction to Text Classification: Impact of Stemming and Comparing TF-IDF and Count Vectorization as Feature Extraction Technique. In: Yilmaz, M., Clarke, P., Messnarz, R., Reiner, M. (eds) Systems, Software and Services Process Improvement. EuroSPI 2021. Communications in Computer and Information Science, vol 1442. Springer, Cham. https://doi.org/10.1007/978-3-030-85521-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-85521-5_19
Published: 25 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85520-8
Online ISBN: 978-3-030-85521-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics