ABSTRACT
The objective of this research is to develop a hybrid model for optimizing the performance of text classification techniques. The authors applied the Genetic Algorithm and Multi-Class Support Vector Machine on the publicly available datasets viz. 20 Newsgroup corpus, and the Reuters 21,578 corpus. They also used their handcrafted 'Creative corpus' prepared by collecting news articles from the Times of India news portal. They evaluated the performance of their model on large as well as small corpora. They employed the Genetic Algorithm that dynamically decides the weights of the contextual features to achieve the highest classification accuracy. The model achieves the highest accuracy of 100 % on small datasets of Reuters 21,578 and Creative corpus. The authors also presented a comparative analysis of the statistical and context-based approaches applied for the text classification. Based on the experimental results they proved that statistical approaches are better for text classification in the case of small-sized documents. Whereas the context-based approaches are efficient in the classification of huge documents enriched with text. This showed the importance of the hybrid approach. The hybrid approach taps the power of ontological databases and can adapt to varying corpora flawlessly. Thus, it makes effective use of textual data available in reports for crime detection, crime classification, and disease diagnosis, etc.
- Upasana, S. Chakraverty. 2011. A Review of Text Classification Approaches for E-mail Management, International Journal of Computer Theory and Engineering, Vol.3, No. 2, Pages 137--144.Google Scholar
- Giovanni Angelini, Marco Gori, Leonardo Rigutini, Franco Scarselli Marco Ernandes. 2007. An Adaptive Context based algorithm for Term Weighting. In Proceeding of 20th International Joint Conference on Artifical intelligence, San Francisco, USA, 2748--2753. DOI: https://dl.acm.org/doi/10.5555/1625275.1625717Google Scholar
- Wen Zhang, Taketoshi Yoshida, and Xijin Tang. 2008. TFIDF, LSI and Multi-word in Information Retrivel and Text Categorization. In Proceeding of IEEE International Conference on System, Man, Cybernetics (SMC 2008), 108--113. DOI.10.1.1.458.587Google Scholar
- Jin Li and Wei Yi Liu Kun Yue. 2008. An adaptive Markov Model for Text Categorization. In Proceeding of 3rd International Conference on Intelligent System and Knowledge Engineering, 802--807. DOI: 10.1109/ISKE.2008.4731039Google ScholarCross Ref
- Silky Arora and Shampa Chakraverty. 2011. A Parallel Approach to Context-based Term Weighting. In Proceeding of World Congress on Information and Communication Technologies. 951--956. DOI:10.1109/WICT.2011.6141376Google ScholarCross Ref
- S. M. Khalessizadeh, R. Zaefarian, and S. H. Nasseri, and E. Ardil. 2006. Genetic Mining: Genetic Algorithm for topic based on concept distribution. In Proceeding of World Academy of Science, Engineering and Technology. 144--147.Google Scholar
- David E. Goldberg. 2001. Genetic Algorithm, 4th ed. Delhi, India: Pearson Education.Google Scholar
- U. Pandey. 2016. A Framework for Collaborative Document Classification with GA-SVM. International Journal of Scientific Research in Science, Engineering and Technology (IJSRSET), Vol. 2, Issue 6, 104--114.Google Scholar
- M. Thangaraj, M. Sivakami. 2018. Text Classification Techniques: A Literature Review. Interdisciplinary Journal of Information, Knowledge, Management, Vol 13, 117--135.DOI: https://doi.org/10.28945/4066.Google ScholarCross Ref
- Berna Altınel, Murat Can Ganiz. 2018. Semantic text classification: A survey of past and recent advances. Information Processing and Management, 54, 1129--1153. DOI:10.1016/j.ipm.2018.08.001Google ScholarCross Ref
- Shadi Diab, Nasim Kamal. 2019. Optimizing Support Vector Machine Classification Based on Semantic-Text Knowledge Enrichment. Palestinian Journal of Technology & Applied Sciences, No. 2. DOI:10.5281/zenodo.2582946Google ScholarCross Ref
- José R. Méndez, Tomás R. Cotos-Yañez, David Ruano-Ordás. 2019. A new semantic-based feature selection method for spam filtering. Applied Soft Computing Journal, 89--104. DOI:http://hdl.handle.net/11093/1149Google Scholar
- Bla_z_Skrlj et al. 2020. tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification. Computer Speech & Language. Vol 65, DOI:https://doi.org/10.1016/j.csl.2020.101104Google ScholarCross Ref
- Air Cervantes, Farid Garcia-Lamont, Lisbeth Rodríguez-Mazahua, Asdrubal Lopez. 2020. A comprehensive survey on support vector machine classification: Applications, challenges and trends, Neurocomputing, Volume 408, 2020, 189--215. ISSN 0925--2312. DOI: https://doi.org/10.1016/j.neucom.2019.10.118.Google ScholarCross Ref
- 20 News Group, http://qwone.com/~jason/20Newsgroups/Google Scholar
- Reuters 21578 data collection. [Online]. http://www.daviddlewis.com/resources/testcollections/reuters21578 Times of India. [Online]. http://timesofindia.indiatimes.com/topic/Google Scholar
- L. Barak, Ido Dagan & Eyal Shnarch. 2009. Text categorization from category name via lexical reference. In Proceeding of Human Language Technoligies, NAACL HLT, 33--36. DOI: DOI:10.3115/1620853.1620864Google ScholarCross Ref
- Dinakar Jayarajan. 2008. Lexical Chains as Document Feature. In Proceeding of 3rd International Joint Conference on Natural Language Processing, Vol 1, Hyderabad, IndiaGoogle Scholar
Index Terms
- Applying GA-SVM for Optimizing Statistical and Semantic Features in Document Classification
Recommendations
Chinese Question Classification Based on Semantic Gram and SVM
IFCSTA '09: Proceedings of the 2009 International Forum on Computer Science-Technology and Applications - Volume 01Question classification plays a crucial important role in the question answering system. Recent research on question classification for open-domain mostly concentrates on using machine learning methods to resolve the special kind of text classification. ...
Adaboost with SVM-based classifier for the classification of brain motor imagery tasks
UAHCI'11: Proceedings of the 6th international conference on Universal access in human-computer interaction: users diversity - Volume Part IIThe Adaboost with SVM-based component classifier is generally considered to break the Boosting principle for the difficulty in training of SVM and have imbalance between the diversity and accuracy over basic SVM classifiers. The Adaboost classifier in ...
A Tree-Based Multi-class SVM Classifier for Digital Library Document
MMIT '08: Proceedings of the 2008 International Conference on MultiMedia and Information TechnologyIn this paper, we present a new method of using Support Vector Machine (SVM) for multiclass classification. In our method, we use a tree based SVM classifier for classification. Compared with the other SVM multi-class classification methods in ...
Comments