skip to main content
10.1145/3209280.3209526acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

Comparative Study between Traditional Machine Learning and Deep Learning Approaches for Text Classification

Published: 28 August 2018 Publication History

Abstract

In this contemporaneous world, it is an obligation for any organization working with documents to end up with the insipid task of classifying truckload of documents, which is the nascent stage of venturing into the realm of information retrieval and data mining. But classification of such humongous documents into multiple classes, calls for a lot of time and labor. Hence a system which could classify these documents with acceptable accuracy would be of an unfathomable help in document engineering. We have created multiple classifiers for document classification and compared their accuracy on raw and processed data. We have garnered data used in a corporate organization as well as publicly available data for comparison. Data is processed by removing the stop-words and stemming is implemented to produce root words. Multiple traditional machine learning techniques like Naive Bayes, Logistic Regression, Support Vector Machine, Random forest Classifier and Multi-Layer Perceptron are used for classification of documents. Classifiers are applied on raw and processed data separately and their accuracy is noted. Along with this, Deep learning technique such as Convolution Neural Network is also used to classify the data and its accuracy is compared with that of traditional machine learning techniques. We are also exploring hierarchical classifiers for classification of classes and subclasses. The system classifies the data faster and with better accuracy than if done manually. The results are discussed in the results and evaluation section.

References

[1]
T. M. Mitchell: Machine learning.Burr Ridge, IL: McGraw Hill, 1997.
[2]
J. Thornton, "Techniques In Computational Learning", Chapman and Hall, London, 1992.
[3]
Jurafsky D, Martin JH: Speech and Language Processing. Pearson Education India, 2000.
[4]
Archana Chaudhary, Savita Kolhe, Rajkamal, "Machine Learning Techniques for Mobile Intelligent Systems: A Study", IEEE International conference on Wireless and Optical Communications Networks, ISBN 978-1-4673-1988-1, 2012
[5]
Chanawee Chavaltada, Kitsuchart Pasupa, David R. Hardoon, "A Comparative Study of Machine Learning Techniques in Automatic Product Categorisation", In Proceeding of the 14th International Symposium on Neural Networks (ISNN 2017), 21-23 June 2017, Hokkaido, Japan (Fengyu Cong, Andrew Leung, Qinglai Wei, eds.), vol. 10261, pp. 10-17, 2017.
[6]
A Comparative Study of Machine Learning Methods for Verbal Autopsy Text Classification. / Danso, Samuel; Atwell, Eric; Johnson, Owen. In: International Journal of Computer Science, 18.02.2014.
[7]
P. Strecht, L. Cruz, C. Soares, J. Mendes-Moreira, and R. Abreu, "A Comparative Study of Classification and Regression Algorithms for Modelling Students' Academic Performance," in International Educational Data Mining Society, 2015, pp. 392--395.
[8]
Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), 1746--1751.
[9]
Yih, Wen-tau et al. "Semantic Parsing for Single-Relation Question Answering." ACL (2014).
[10]
Zhang, Ye and Byron C. Wallace. "A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification." CoRR abs/1510.03820 (2015): n. pag.
[11]
Xu, Baoxun et al. "An Improved Random Forest Classifier for Text Categorization." JCP 7 (2012): 2913-2920.
[12]
Breiman L, Random Forests, Machine Learning, 45, 5-32, (2001).
[13]
VN Vapnik: An overview of statistical learning theory. Neural Networks, IEEE Transactions on, 1999, 10:988-999.
[14]
G Madzarov, D. Gjorgievikj and I. Chorbev," A Multi-class SVM Classifier Utilizing Binary Decision Tree", Informatica, pp. 233-241 (2009).
[15]
Isabelle Guyon, Bernhard E. Boser, and Vladimir Vapnik. 1992. Automatic Capacity Tuning of Very Large VC-Dimension Classifiers. In Advances in Neural Information Processing Systems 5, {NIPS Conference}, Stephen Jose Hanson, Jack D. Cowan, and C. Lee Giles (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 147--155.
[16]
https://hal.inria.fr/hal-00860051/document
[17]
Aaron Defazio, Francis Bach, Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. Advances In Neural Information Processing Systems, Nov 2014, Montreal, Canada. <hal-01016843v3>
[18]
Yuth, K.: Principle and using logistic regression analysis for research. RMUTSV Res. J. 4(1), 1-12 (2012)
[19]
Qingshan Ni, Zheng-Zhi Wang, Qingjuan Han, Gangguo Li, Xiaomin Wang, Guangyun Wang, "Using Logistic Regression Method to Predict Protein Function from Protein-Protein Interaction Data", ICBBE 2009. 3rd International Conference on Bioinformatics and Biomedical Engineering, E-ISBN 978-1-4244-2902-8, 2009
[20]
Aurangzeb Khan, Baharum Baharudin, Lam Hong Lee,"A Review of Machine Learning Algorithms for Text-Documents Classification", Journal Of Advances In Information Technology, February 2010.
[21]
Lewis, D.D.: Naive (Bayes) at forty: the independence assumption in information retrieval. In: NÃl'dellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4--15. Springer, Heidelberg (1998).
[22]
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[23]
McCallum, Andrew; Nigam, Kamal (1998). A comparison of event models for Naive Bayes text classification (PDF). AAAI-98 workshop on learning for text categorization. 752.
[24]
Rennie, J.; Shih, L.; Teevan, J.; Karger, D. (2003). Tackling the poor assumptions of Naive Bayes classifiers
[25]
Rosenblatt, Frank. x. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington DC, 1961
[26]
Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams. "Learning Internal Representations by Error Propagation". David E. Rumelhart, James L. McClelland, and the PDP research group. (editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundation. MIT Press, 1986.
[27]
Sessions, Valerie and Marco G. Valtorta. "The Effects of Data Quality on Machine Learning Algorithms." ICIQ (2006).

Cited By

View all
  • (2025)Deep Learning vs. Machine Learning for Intrusion Detection in Computer Networks: A Comparative StudyApplied Sciences10.3390/app1504190315:4(1903)Online publication date: 12-Feb-2025
  • (2025)TepiSense: A Social Computing-Based Real-Time Epidemic Surveillance System Using Artificial IntelligenceIEEE Access10.1109/ACCESS.2025.353716813(23816-23832)Online publication date: 2025
  • (2025)Comprehensive Analysis of Word Embedding Models and Design of Effective Feature Vector for Classification of Amazon Product ReviewsIEEE Access10.1109/ACCESS.2025.353663113(25239-25255)Online publication date: 2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018
August 2018
311 pages
ISBN:9781450357692
DOI:10.1145/3209280
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • SIGDOC: ACM Special Interest Group on Systems Documentation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Business Document Analysis
  2. Deep Learning
  3. Document Classification
  4. Machine Learning
  5. Text Mining

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

DocEng '18
Sponsor:
DocEng '18: ACM Symposium on Document Engineering 2018
August 28 - 31, 2018
NS, Halifax, Canada

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)445
  • Downloads (Last 6 weeks)35
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Deep Learning vs. Machine Learning for Intrusion Detection in Computer Networks: A Comparative StudyApplied Sciences10.3390/app1504190315:4(1903)Online publication date: 12-Feb-2025
  • (2025)TepiSense: A Social Computing-Based Real-Time Epidemic Surveillance System Using Artificial IntelligenceIEEE Access10.1109/ACCESS.2025.353716813(23816-23832)Online publication date: 2025
  • (2025)Comprehensive Analysis of Word Embedding Models and Design of Effective Feature Vector for Classification of Amazon Product ReviewsIEEE Access10.1109/ACCESS.2025.353663113(25239-25255)Online publication date: 2025
  • (2025)Strategy-Switch: From All-Reduce to Parameter Server for Faster Efficient TrainingIEEE Access10.1109/ACCESS.2025.352824813(9510-9523)Online publication date: 2025
  • (2025)Artificial intelligence for the discovery of antimicrobial peptidesAntimicrobial Peptides10.1016/B978-0-443-15393-8.00003-8(59-79)Online publication date: 2025
  • (2025)Decision support system to reveal future career over students’ survey using explainable AIEducation and Information Technologies10.1007/s10639-025-13361-7Online publication date: 30-Jan-2025
  • (2024)Revolutionizing Duplicate Question Detection: A Deep Learning Approach for Stack OverflowIgMin Research10.61927/igmin1352:1(001-005)Online publication date: 9-Jan-2024
  • (2024)Marine Equipment Siting Using Machine-Learning-Based Ocean Remote Sensing Data: Current Status and Future ProspectsSustainability10.3390/su1620888916:20(8889)Online publication date: 14-Oct-2024
  • (2024)Comprehensive Review of Multiclass Text Classification using the 20 Newsgroup DatasetInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT24106116610:6(1193-1212)Online publication date: 30-Nov-2024
  • (2024)Sentiment Analysis on a Large Indonesian Product Review DatasetJournal of Information Systems Engineering and Business Intelligence10.20473/jisebi.10.1.167-17810:1(167-178)Online publication date: 28-Feb-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media