skip to main content
10.1145/3232116.3232152acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiciipConference Proceedingsconference-collections
research-article

An Improved TF-IDF algorithm based on word frequency distribution information and category distribution information

Authors Info & Claims
Published:19 May 2018Publication History

ABSTRACT

Traditional TF-IDF (Term Frequency-Inverse Document Frequency) feature weighting algorithm only uses word frequency information as a measure of the importance of feature items in the data set. This results in the inability to correctly reflect the differences between documents of different categories. This paper proposes an improved feature weighting algorithm FDCD-TF-IDF based on word frequency distribution information and category distribution information. The improved algorithm introduces the concept of word frequency distribution and class distribution to describe the weight of the feature item more accurately. The word frequency distribution is mainly aimed at the correlation between feature items and categories, and the category distribution can better reflect category information of feature items. This improved algorithm can accurately reflect the differences between different text categories. The experimental results show that the improved algorithm can achieve better classification results on both balanced and unbalanced text data sets.

References

  1. Brooks M, Amershi S, Lee B, et al. FeatureInsight: Visual support for error-driven feature ideation in text classification{C}// Visual Analytics Science and Technology. IEEE, 2015:105--112.Google ScholarGoogle Scholar
  2. Chandrashekar G, Sahin F. A survey on feature selection methods{M}. Pergamon Press, Inc. 2014.Google ScholarGoogle Scholar
  3. Chunxia T. Research on the Multilevel Security Authorization Method Based on Image Content{J}. 2017.Google ScholarGoogle Scholar
  4. Jie F, Xiaojun L. Design of Upright Intelligent Vehicle Based on Camera{J}. 2017.Google ScholarGoogle Scholar
  5. Haque M M, Pervin S, Begum Z. Automatic Bengali news documents summarization by introducing sentence frequency and clustering{C}// International Conference on Computer and Information Technology. IEEE, 2016:156--160.Google ScholarGoogle Scholar
  6. Tang B, He H, Baggenstoss P M, et al. A Bayesian Classification Approach Using Class-Specific Features for Text Categorization{J}. IEEE Transactions on Knowledge & Data Engineering, 2016, 28(6):1602--1606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Uysal A K, Gunal S. The impact of preprocessing on text classification{J}. Information Processing & Management, 2014, 50(1):104--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bruno T, Sasa M, Dzenana D, et al. KNN with TF-IDF based framework for text categorization{C}// Daaam International Symposium on Intelligent Manufacturing and Automation. 2013:1356--1364.Google ScholarGoogle Scholar
  9. How B C, Narayanan K. An Empirical Study of Feature Selection for Text Categorization based on Term Weightage{C}// Web Intelligence, 2004. WI 2004. Proceedings. IEEE/WIC/ACM International Conference on. IEEE, 2004:599--602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Vidal M, Menezes G V, Berlt K, et al. Selecting keywords to represent web pages using Wikipedia information{C}// Brazilian Symposium on Multimedia and the Web. 2012:375--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Liu M, Yang J. An improvement of TFIDF weighting in text categorization{J}. International Proceedings of Computer Science & Information Tech, 2012.Google ScholarGoogle Scholar
  12. Zhou Y, Tang J, Wang J. An Improved TFIDF Feature Selection Algorithm Based On Information Entropy{C}// Chinese Control Conference. IEEE, 2007:312--315.Google ScholarGoogle Scholar
  13. Selvi S T, Karthikeyan P, Vincent A, et al. Text categorization using Rocchio algorithm and random forest algorithm{C}// Eighth International Conference on Advanced Computing. IEEE, 2017:7--12Google ScholarGoogle Scholar

Index Terms

  1. An Improved TF-IDF algorithm based on word frequency distribution information and category distribution information

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICIIP '18: Proceedings of the 3rd International Conference on Intelligent Information Processing
      May 2018
      249 pages
      ISBN:9781450364966
      DOI:10.1145/3232116

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 May 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate87of367submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader