research-article

Comparative Study between Traditional Machine Learning and Deep Learning Approaches for Text Classification

Authors:

Cannannore Nidhi Kamath,

Syed Saqib Bukhari,

Andreas DengelAuthors Info & Claims

DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018

Article No.: 14, Pages 1 - 11

https://doi.org/10.1145/3209280.3209526

Published: 28 August 2018 Publication History

Abstract

In this contemporaneous world, it is an obligation for any organization working with documents to end up with the insipid task of classifying truckload of documents, which is the nascent stage of venturing into the realm of information retrieval and data mining. But classification of such humongous documents into multiple classes, calls for a lot of time and labor. Hence a system which could classify these documents with acceptable accuracy would be of an unfathomable help in document engineering. We have created multiple classifiers for document classification and compared their accuracy on raw and processed data. We have garnered data used in a corporate organization as well as publicly available data for comparison. Data is processed by removing the stop-words and stemming is implemented to produce root words. Multiple traditional machine learning techniques like Naive Bayes, Logistic Regression, Support Vector Machine, Random forest Classifier and Multi-Layer Perceptron are used for classification of documents. Classifiers are applied on raw and processed data separately and their accuracy is noted. Along with this, Deep learning technique such as Convolution Neural Network is also used to classify the data and its accuracy is compared with that of traditional machine learning techniques. We are also exploring hierarchical classifiers for classification of classes and subclasses. The system classifies the data faster and with better accuracy than if done manually. The results are discussed in the results and evaluation section.

References

[1]

T. M. Mitchell: Machine learning.Burr Ridge, IL: McGraw Hill, 1997.

Digital Library

[2]

J. Thornton, "Techniques In Computational Learning", Chapman and Hall, London, 1992.

[3]

Jurafsky D, Martin JH: Speech and Language Processing. Pearson Education India, 2000.

Digital Library

[4]

Archana Chaudhary, Savita Kolhe, Rajkamal, "Machine Learning Techniques for Mobile Intelligent Systems: A Study", IEEE International conference on Wireless and Optical Communications Networks, ISBN 978-1-4673-1988-1, 2012

[5]

Chanawee Chavaltada, Kitsuchart Pasupa, David R. Hardoon, "A Comparative Study of Machine Learning Techniques in Automatic Product Categorisation", In Proceeding of the 14th International Symposium on Neural Networks (ISNN 2017), 21-23 June 2017, Hokkaido, Japan (Fengyu Cong, Andrew Leung, Qinglai Wei, eds.), vol. 10261, pp. 10-17, 2017.

[6]

A Comparative Study of Machine Learning Methods for Verbal Autopsy Text Classification. / Danso, Samuel; Atwell, Eric; Johnson, Owen. In: International Journal of Computer Science, 18.02.2014.

[7]

P. Strecht, L. Cruz, C. Soares, J. Mendes-Moreira, and R. Abreu, "A Comparative Study of Classification and Regression Algorithms for Modelling Students' Academic Performance," in International Educational Data Mining Society, 2015, pp. 392--395.

[8]

Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), 1746--1751.

[9]

Yih, Wen-tau et al. "Semantic Parsing for Single-Relation Question Answering." ACL (2014).

[10]

Zhang, Ye and Byron C. Wallace. "A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification." CoRR abs/1510.03820 (2015): n. pag.

[11]

Xu, Baoxun et al. "An Improved Random Forest Classifier for Text Categorization." JCP 7 (2012): 2913-2920.

[12]

Breiman L, Random Forests, Machine Learning, 45, 5-32, (2001).

Digital Library

[13]

VN Vapnik: An overview of statistical learning theory. Neural Networks, IEEE Transactions on, 1999, 10:988-999.

Digital Library

[14]

G Madzarov, D. Gjorgievikj and I. Chorbev," A Multi-class SVM Classifier Utilizing Binary Decision Tree", Informatica, pp. 233-241 (2009).

[15]

Isabelle Guyon, Bernhard E. Boser, and Vladimir Vapnik. 1992. Automatic Capacity Tuning of Very Large VC-Dimension Classifiers. In Advances in Neural Information Processing Systems 5, {NIPS Conference}, Stephen Jose Hanson, Jack D. Cowan, and C. Lee Giles (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 147--155.

Digital Library

[16]

https://hal.inria.fr/hal-00860051/document

[17]

Aaron Defazio, Francis Bach, Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. Advances In Neural Information Processing Systems, Nov 2014, Montreal, Canada. <hal-01016843v3>

Digital Library

[18]

Yuth, K.: Principle and using logistic regression analysis for research. RMUTSV Res. J. 4(1), 1-12 (2012)

[19]

Qingshan Ni, Zheng-Zhi Wang, Qingjuan Han, Gangguo Li, Xiaomin Wang, Guangyun Wang, "Using Logistic Regression Method to Predict Protein Function from Protein-Protein Interaction Data", ICBBE 2009. 3rd International Conference on Bioinformatics and Biomedical Engineering, E-ISBN 978-1-4244-2902-8, 2009

[20]

Aurangzeb Khan, Baharum Baharudin, Lam Hong Lee,"A Review of Machine Learning Algorithms for Text-Documents Classification", Journal Of Advances In Information Technology, February 2010.

[21]

Lewis, D.D.: Naive (Bayes) at forty: the independence assumption in information retrieval. In: NÃl'dellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4--15. Springer, Heidelberg (1998).

Digital Library

[22]

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Digital Library

[23]

McCallum, Andrew; Nigam, Kamal (1998). A comparison of event models for Naive Bayes text classification (PDF). AAAI-98 workshop on learning for text categorization. 752.

[24]

Rennie, J.; Shih, L.; Teevan, J.; Karger, D. (2003). Tackling the poor assumptions of Naive Bayes classifiers

[25]

Rosenblatt, Frank. x. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington DC, 1961

[26]

Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams. "Learning Internal Representations by Error Propagation". David E. Rumelhart, James L. McClelland, and the PDP research group. (editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundation. MIT Press, 1986.

Digital Library

[27]

Sessions, Valerie and Marco G. Valtorta. "The Effects of Data Quality on Machine Learning Algorithms." ICIQ (2006).

Cited By

Ali MThakur KSchmeelk SDebello JDragos D(2025)Deep Learning vs. Machine Learning for Intrusion Detection in Computer Networks: A Comparative StudyApplied Sciences10.3390/app1504190315:4(1903)Online publication date: 12-Feb-2025
https://doi.org/10.3390/app15041903
Tahir BAmir Mehmood M(2025)TepiSense: A Social Computing-Based Real-Time Epidemic Surveillance System Using Artificial IntelligenceIEEE Access10.1109/ACCESS.2025.353716813(23816-23832)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3537168
Priya Kamath BGeetha MDinesh Acharya USingh DRao ARai SShetty R(2025)Comprehensive Analysis of Word Embedding Models and Design of Effective Feature Vector for Classification of Amazon Product ReviewsIEEE Access10.1109/ACCESS.2025.353663113(25239-25255)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3536631
Show More Cited By

Index Terms

Comparative Study between Traditional Machine Learning and Deep Learning Approaches for Text Classification

Recommendations

The impact of deep learning on document classification using semantically rich representations
Highlights
- Provides a novel document representation model enriched with semantical information.
Abstract
This paper presents a semantically rich document representation model for automatically classifying financial documents into predefined categories utilizing deep learning. The model architecture consists of two main modules including ...
New Bagging Based Ensemble Learning Algorithm Distinguishing Short and Long Texts for Document Classification
To improve the classification accuracy of ensemble learning, a new bootstrap aggregating (Bagging) ensemble learning algorithm distinguishing short and long texts for document classification is proposed. First, the performances of different typical deep ...
Hierarchical Text Classification Incremental Learning
ICONIP '09: Proceedings of the 16th International Conference on Neural Information Processing: Part I

To classify large-scale text corpora, an incremental learning method for hierarchical text classification is proposed. Based on the deep analysis of virtual classification tree based hierarchical text classification, combining the two application models ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018

August 2018

311 pages

ISBN:9781450357692

DOI:10.1145/3209280

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

In-Cooperation

SIGDOC: ACM Special Interest Group on Systems Documentation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 August 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

DocEng '18

Sponsor:

SIGWEB

DocEng '18: ACM Symposium on Document Engineering 2018

August 28 - 31, 2018

NS, Halifax, Canada

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

75
Total Citations
View Citations
1,845
Total Downloads

Downloads (Last 12 months)445
Downloads (Last 6 weeks)35

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ali MThakur KSchmeelk SDebello JDragos D(2025)Deep Learning vs. Machine Learning for Intrusion Detection in Computer Networks: A Comparative StudyApplied Sciences10.3390/app1504190315:4(1903)Online publication date: 12-Feb-2025
https://doi.org/10.3390/app15041903
Tahir BAmir Mehmood M(2025)TepiSense: A Social Computing-Based Real-Time Epidemic Surveillance System Using Artificial IntelligenceIEEE Access10.1109/ACCESS.2025.353716813(23816-23832)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3537168
Priya Kamath BGeetha MDinesh Acharya USingh DRao ARai SShetty R(2025)Comprehensive Analysis of Word Embedding Models and Design of Effective Feature Vector for Classification of Amazon Product ReviewsIEEE Access10.1109/ACCESS.2025.353663113(25239-25255)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3536631
Provatas NChalas IKonstantinou IKoziris N(2025)Strategy-Switch: From All-Reduce to Parameter Server for Faster Efficient TrainingIEEE Access10.1109/ACCESS.2025.352824813(9510-9523)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3528248
Ruiz Puentes PAparicio Claros NArbeláez P(2025)Artificial intelligence for the discovery of antimicrobial peptidesAntimicrobial Peptides10.1016/B978-0-443-15393-8.00003-8(59-79)Online publication date: 2025
https://doi.org/10.1016/B978-0-443-15393-8.00003-8
Faruque SKhushbu SAkter S(2025)Decision support system to reveal future career over students’ survey using explainable AIEducation and Information Technologies10.1007/s10639-025-13361-7Online publication date: 30-Jan-2025
https://doi.org/10.1007/s10639-025-13361-7
Muhammad FHarun J(2024)Revolutionizing Duplicate Question Detection: A Deep Learning Approach for Stack OverflowIgMin Research10.61927/igmin1352:1(001-005)Online publication date: 9-Jan-2024
https://doi.org/10.61927/igmin135
Zhang DMa YZhang HZhang Y(2024)Marine Equipment Siting Using Machine-Learning-Based Ocean Remote Sensing Data: Current Status and Future ProspectsSustainability10.3390/su1620888916:20(8889)Online publication date: 14-Oct-2024
https://doi.org/10.3390/su16208889
Michael Babatunde Adewoye Dr. Safina Ara (2024)Comprehensive Review of Multiclass Text Classification using the 20 Newsgroup DatasetInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT24106116610:6(1193-1212)Online publication date: 30-Nov-2024
https://doi.org/10.32628/CSEIT241061166
Romadhony AAl Faraby SRismala RWisesty UArifianto A(2024)Sentiment Analysis on a Large Indonesian Product Review DatasetJournal of Information Systems Engineering and Business Intelligence10.20473/jisebi.10.1.167-17810:1(167-178)Online publication date: 28-Feb-2024
https://doi.org/10.20473/jisebi.10.1.167-178
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten