skip to main content
10.1145/3232116.3232152acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiciipConference Proceedingsconference-collections
research-article

An Improved TF-IDF algorithm based on word frequency distribution information and category distribution information

Published: 19 May 2018 Publication History

Abstract

Traditional TF-IDF (Term Frequency-Inverse Document Frequency) feature weighting algorithm only uses word frequency information as a measure of the importance of feature items in the data set. This results in the inability to correctly reflect the differences between documents of different categories. This paper proposes an improved feature weighting algorithm FDCD-TF-IDF based on word frequency distribution information and category distribution information. The improved algorithm introduces the concept of word frequency distribution and class distribution to describe the weight of the feature item more accurately. The word frequency distribution is mainly aimed at the correlation between feature items and categories, and the category distribution can better reflect category information of feature items. This improved algorithm can accurately reflect the differences between different text categories. The experimental results show that the improved algorithm can achieve better classification results on both balanced and unbalanced text data sets.

References

[1]
Brooks M, Amershi S, Lee B, et al. FeatureInsight: Visual support for error-driven feature ideation in text classification{C}// Visual Analytics Science and Technology. IEEE, 2015:105--112.
[2]
Chandrashekar G, Sahin F. A survey on feature selection methods{M}. Pergamon Press, Inc. 2014.
[3]
Chunxia T. Research on the Multilevel Security Authorization Method Based on Image Content{J}. 2017.
[4]
Jie F, Xiaojun L. Design of Upright Intelligent Vehicle Based on Camera{J}. 2017.
[5]
Haque M M, Pervin S, Begum Z. Automatic Bengali news documents summarization by introducing sentence frequency and clustering{C}// International Conference on Computer and Information Technology. IEEE, 2016:156--160.
[6]
Tang B, He H, Baggenstoss P M, et al. A Bayesian Classification Approach Using Class-Specific Features for Text Categorization{J}. IEEE Transactions on Knowledge & Data Engineering, 2016, 28(6):1602--1606.
[7]
Uysal A K, Gunal S. The impact of preprocessing on text classification{J}. Information Processing & Management, 2014, 50(1):104--112.
[8]
Bruno T, Sasa M, Dzenana D, et al. KNN with TF-IDF based framework for text categorization{C}// Daaam International Symposium on Intelligent Manufacturing and Automation. 2013:1356--1364.
[9]
How B C, Narayanan K. An Empirical Study of Feature Selection for Text Categorization based on Term Weightage{C}// Web Intelligence, 2004. WI 2004. Proceedings. IEEE/WIC/ACM International Conference on. IEEE, 2004:599--602.
[10]
Vidal M, Menezes G V, Berlt K, et al. Selecting keywords to represent web pages using Wikipedia information{C}// Brazilian Symposium on Multimedia and the Web. 2012:375--382.
[11]
Liu M, Yang J. An improvement of TFIDF weighting in text categorization{J}. International Proceedings of Computer Science & Information Tech, 2012.
[12]
Zhou Y, Tang J, Wang J. An Improved TFIDF Feature Selection Algorithm Based On Information Entropy{C}// Chinese Control Conference. IEEE, 2007:312--315.
[13]
Selvi S T, Karthikeyan P, Vincent A, et al. Text categorization using Rocchio algorithm and random forest algorithm{C}// Eighth International Conference on Advanced Computing. IEEE, 2017:7--12

Cited By

View all
  • (2024)An Ensemble-Based Multi-Classification Machine Learning Classifiers Approach to Detect Multiple Classes of CyberbullyingMachine Learning and Knowledge Extraction10.3390/make60100096:1(156-170)Online publication date: 12-Jan-2024
  • (2024)News dissemination: a semantic approach to barrier classificationJournal of Intelligent Information Systems10.1007/s10844-024-00894-5Online publication date: 4-Nov-2024
  • (2024)Machine Learning and Deep Learning Techniques in Countering CyberterrorismCyberspace, Cyberterrorism and the International Security in the Fourth Industrial Revolution10.1007/978-3-031-50454-9_8(135-158)Online publication date: 19-Jan-2024
  • Show More Cited By

Index Terms

  1. An Improved TF-IDF algorithm based on word frequency distribution information and category distribution information

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICIIP '18: Proceedings of the 3rd International Conference on Intelligent Information Processing
    May 2018
    249 pages
    ISBN:9781450364966
    DOI:10.1145/3232116
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • Guilin: Guilin University of Technology, Guilin, China
    • International Engineering and Technology Institute, Hong Kong: International Engineering and Technology Institute, Hong Kong

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 May 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. TF-IDF
    2. feature selection
    3. feature weighting
    4. text categorization

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICIIP '18

    Acceptance Rates

    Overall Acceptance Rate 87 of 367 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)28
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An Ensemble-Based Multi-Classification Machine Learning Classifiers Approach to Detect Multiple Classes of CyberbullyingMachine Learning and Knowledge Extraction10.3390/make60100096:1(156-170)Online publication date: 12-Jan-2024
    • (2024)News dissemination: a semantic approach to barrier classificationJournal of Intelligent Information Systems10.1007/s10844-024-00894-5Online publication date: 4-Nov-2024
    • (2024)Machine Learning and Deep Learning Techniques in Countering CyberterrorismCyberspace, Cyberterrorism and the International Security in the Fourth Industrial Revolution10.1007/978-3-031-50454-9_8(135-158)Online publication date: 19-Jan-2024
    • (2023)On the class separability of contextual embeddings representations – or “The classifier does not matter when the (text) representation is so good!”Information Processing and Management: an International Journal10.1016/j.ipm.2023.10333660:4Online publication date: 1-Jul-2023
    • (2023)Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasksNeural Computing and Applications10.1007/s00521-023-08967-235:29(22055-22067)Online publication date: 7-Sep-2023
    • (2022)Problem formulation in inventive design using Doc2vec and Cosine Similarity as Artificial Intelligence methods and Scientific PapersEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.104661109:COnline publication date: 1-Mar-2022
    • (2022)Evolution pathways of robotic technologies and applications in constructionAdvanced Engineering Informatics10.1016/j.aei.2022.10152951:COnline publication date: 1-Jan-2022
    • (2021)Machine Learning for Detecting Data ExfiltrationACM Computing Surveys10.1145/344218154:3(1-47)Online publication date: 8-May-2021
    • (2021)A Transformers Approach to Detect Depression in Social Media2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS)10.1109/ICAIS50930.2021.9395943(718-723)Online publication date: 25-Mar-2021
    • (2021)Personalization of News for a Logistics Organisation by Finding Relevancy Using NLPModern Approaches in Machine Learning and Cognitive Science: A Walkthrough10.1007/978-3-030-68291-0_16(215-226)Online publication date: 27-Apr-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media