skip to main content
10.1145/3484824.3484912acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdsmlaiConference Proceedingsconference-collections
research-article

Applying GA-SVM for Optimizing Statistical and Semantic Features in Document Classification

Authors Info & Claims
Published:13 January 2022Publication History

ABSTRACT

The objective of this research is to develop a hybrid model for optimizing the performance of text classification techniques. The authors applied the Genetic Algorithm and Multi-Class Support Vector Machine on the publicly available datasets viz. 20 Newsgroup corpus, and the Reuters 21,578 corpus. They also used their handcrafted 'Creative corpus' prepared by collecting news articles from the Times of India news portal. They evaluated the performance of their model on large as well as small corpora. They employed the Genetic Algorithm that dynamically decides the weights of the contextual features to achieve the highest classification accuracy. The model achieves the highest accuracy of 100 % on small datasets of Reuters 21,578 and Creative corpus. The authors also presented a comparative analysis of the statistical and context-based approaches applied for the text classification. Based on the experimental results they proved that statistical approaches are better for text classification in the case of small-sized documents. Whereas the context-based approaches are efficient in the classification of huge documents enriched with text. This showed the importance of the hybrid approach. The hybrid approach taps the power of ontological databases and can adapt to varying corpora flawlessly. Thus, it makes effective use of textual data available in reports for crime detection, crime classification, and disease diagnosis, etc.

References

  1. Upasana, S. Chakraverty. 2011. A Review of Text Classification Approaches for E-mail Management, International Journal of Computer Theory and Engineering, Vol.3, No. 2, Pages 137--144.Google ScholarGoogle Scholar
  2. Giovanni Angelini, Marco Gori, Leonardo Rigutini, Franco Scarselli Marco Ernandes. 2007. An Adaptive Context based algorithm for Term Weighting. In Proceeding of 20th International Joint Conference on Artifical intelligence, San Francisco, USA, 2748--2753. DOI: https://dl.acm.org/doi/10.5555/1625275.1625717Google ScholarGoogle Scholar
  3. Wen Zhang, Taketoshi Yoshida, and Xijin Tang. 2008. TFIDF, LSI and Multi-word in Information Retrivel and Text Categorization. In Proceeding of IEEE International Conference on System, Man, Cybernetics (SMC 2008), 108--113. DOI.10.1.1.458.587Google ScholarGoogle Scholar
  4. Jin Li and Wei Yi Liu Kun Yue. 2008. An adaptive Markov Model for Text Categorization. In Proceeding of 3rd International Conference on Intelligent System and Knowledge Engineering, 802--807. DOI: 10.1109/ISKE.2008.4731039Google ScholarGoogle ScholarCross RefCross Ref
  5. Silky Arora and Shampa Chakraverty. 2011. A Parallel Approach to Context-based Term Weighting. In Proceeding of World Congress on Information and Communication Technologies. 951--956. DOI:10.1109/WICT.2011.6141376Google ScholarGoogle ScholarCross RefCross Ref
  6. S. M. Khalessizadeh, R. Zaefarian, and S. H. Nasseri, and E. Ardil. 2006. Genetic Mining: Genetic Algorithm for topic based on concept distribution. In Proceeding of World Academy of Science, Engineering and Technology. 144--147.Google ScholarGoogle Scholar
  7. David E. Goldberg. 2001. Genetic Algorithm, 4th ed. Delhi, India: Pearson Education.Google ScholarGoogle Scholar
  8. U. Pandey. 2016. A Framework for Collaborative Document Classification with GA-SVM. International Journal of Scientific Research in Science, Engineering and Technology (IJSRSET), Vol. 2, Issue 6, 104--114.Google ScholarGoogle Scholar
  9. M. Thangaraj, M. Sivakami. 2018. Text Classification Techniques: A Literature Review. Interdisciplinary Journal of Information, Knowledge, Management, Vol 13, 117--135.DOI: https://doi.org/10.28945/4066.Google ScholarGoogle ScholarCross RefCross Ref
  10. Berna Altınel, Murat Can Ganiz. 2018. Semantic text classification: A survey of past and recent advances. Information Processing and Management, 54, 1129--1153. DOI:10.1016/j.ipm.2018.08.001Google ScholarGoogle ScholarCross RefCross Ref
  11. Shadi Diab, Nasim Kamal. 2019. Optimizing Support Vector Machine Classification Based on Semantic-Text Knowledge Enrichment. Palestinian Journal of Technology & Applied Sciences, No. 2. DOI:10.5281/zenodo.2582946Google ScholarGoogle ScholarCross RefCross Ref
  12. José R. Méndez, Tomás R. Cotos-Yañez, David Ruano-Ordás. 2019. A new semantic-based feature selection method for spam filtering. Applied Soft Computing Journal, 89--104. DOI:http://hdl.handle.net/11093/1149Google ScholarGoogle Scholar
  13. Bla_z_Skrlj et al. 2020. tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification. Computer Speech & Language. Vol 65, DOI:https://doi.org/10.1016/j.csl.2020.101104Google ScholarGoogle ScholarCross RefCross Ref
  14. Air Cervantes, Farid Garcia-Lamont, Lisbeth Rodríguez-Mazahua, Asdrubal Lopez. 2020. A comprehensive survey on support vector machine classification: Applications, challenges and trends, Neurocomputing, Volume 408, 2020, 189--215. ISSN 0925--2312. DOI: https://doi.org/10.1016/j.neucom.2019.10.118.Google ScholarGoogle ScholarCross RefCross Ref
  15. 20 News Group, http://qwone.com/~jason/20Newsgroups/Google ScholarGoogle Scholar
  16. Reuters 21578 data collection. [Online]. http://www.daviddlewis.com/resources/testcollections/reuters21578 Times of India. [Online]. http://timesofindia.indiatimes.com/topic/Google ScholarGoogle Scholar
  17. L. Barak, Ido Dagan & Eyal Shnarch. 2009. Text categorization from category name via lexical reference. In Proceeding of Human Language Technoligies, NAACL HLT, 33--36. DOI: DOI:10.3115/1620853.1620864Google ScholarGoogle ScholarCross RefCross Ref
  18. Dinakar Jayarajan. 2008. Lexical Chains as Document Feature. In Proceeding of 3rd International Joint Conference on Natural Language Processing, Vol 1, Hyderabad, IndiaGoogle ScholarGoogle Scholar

Index Terms

  1. Applying GA-SVM for Optimizing Statistical and Semantic Features in Document Classification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        DSMLAI '21': Proceedings of the International Conference on Data Science, Machine Learning and Artificial Intelligence
        August 2021
        415 pages
        ISBN:9781450387637
        DOI:10.1145/3484824

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 January 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited
      • Article Metrics

        • Downloads (Last 12 months)11
        • Downloads (Last 6 weeks)1

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader