research-article

An Improved TF-IDF algorithm based on word frequency distribution information and category distribution information

Authors:

Haoying Wu,

Na YuanAuthors Info & Claims

ICIIP '18: Proceedings of the 3rd International Conference on Intelligent Information Processing

Pages 211 - 215

https://doi.org/10.1145/3232116.3232152

Published: 19 May 2018 Publication History

Get Access

Abstract

Traditional TF-IDF (Term Frequency-Inverse Document Frequency) feature weighting algorithm only uses word frequency information as a measure of the importance of feature items in the data set. This results in the inability to correctly reflect the differences between documents of different categories. This paper proposes an improved feature weighting algorithm FDCD-TF-IDF based on word frequency distribution information and category distribution information. The improved algorithm introduces the concept of word frequency distribution and class distribution to describe the weight of the feature item more accurately. The word frequency distribution is mainly aimed at the correlation between feature items and categories, and the category distribution can better reflect category information of feature items. This improved algorithm can accurately reflect the differences between different text categories. The experimental results show that the improved algorithm can achieve better classification results on both balanced and unbalanced text data sets.

References

[1]

Brooks M, Amershi S, Lee B, et al. FeatureInsight: Visual support for error-driven feature ideation in text classification{C}// Visual Analytics Science and Technology. IEEE, 2015:105--112.

Google Scholar

[2]

Chandrashekar G, Sahin F. A survey on feature selection methods{M}. Pergamon Press, Inc. 2014.

Google Scholar

[3]

Chunxia T. Research on the Multilevel Security Authorization Method Based on Image Content{J}. 2017.

Google Scholar

[4]

Jie F, Xiaojun L. Design of Upright Intelligent Vehicle Based on Camera{J}. 2017.

Google Scholar

[5]

Haque M M, Pervin S, Begum Z. Automatic Bengali news documents summarization by introducing sentence frequency and clustering{C}// International Conference on Computer and Information Technology. IEEE, 2016:156--160.

Google Scholar

[6]

Tang B, He H, Baggenstoss P M, et al. A Bayesian Classification Approach Using Class-Specific Features for Text Categorization{J}. IEEE Transactions on Knowledge & Data Engineering, 2016, 28(6):1602--1606.

Digital Library

Google Scholar

[7]

Uysal A K, Gunal S. The impact of preprocessing on text classification{J}. Information Processing & Management, 2014, 50(1):104--112.

Digital Library

Google Scholar

[8]

Bruno T, Sasa M, Dzenana D, et al. KNN with TF-IDF based framework for text categorization{C}// Daaam International Symposium on Intelligent Manufacturing and Automation. 2013:1356--1364.

Google Scholar

[9]

How B C, Narayanan K. An Empirical Study of Feature Selection for Text Categorization based on Term Weightage{C}// Web Intelligence, 2004. WI 2004. Proceedings. IEEE/WIC/ACM International Conference on. IEEE, 2004:599--602.

Digital Library

Google Scholar

[10]

Vidal M, Menezes G V, Berlt K, et al. Selecting keywords to represent web pages using Wikipedia information{C}// Brazilian Symposium on Multimedia and the Web. 2012:375--382.

Digital Library

Google Scholar

[11]

Liu M, Yang J. An improvement of TFIDF weighting in text categorization{J}. International Proceedings of Computer Science & Information Tech, 2012.

Google Scholar

[12]

Zhou Y, Tang J, Wang J. An Improved TFIDF Feature Selection Algorithm Based On Information Entropy{C}// Chinese Control Conference. IEEE, 2007:312--315.

Google Scholar

[13]

Selvi S T, Karthikeyan P, Vincent A, et al. Text categorization using Rocchio algorithm and random forest algorithm{C}// Eighth International Conference on Advanced Computing. IEEE, 2017:7--12

Google Scholar

Cited By

View all

Alqahtani AIlyas M(2024)An Ensemble-Based Multi-Classification Machine Learning Classifiers Approach to Detect Multiple Classes of CyberbullyingMachine Learning and Knowledge Extraction10.3390/make60100096:1(156-170)Online publication date: 12-Jan-2024
https://doi.org/10.3390/make6010009
Sittar AMladenić DGrobelnik M(2024)News dissemination: a semantic approach to barrier classificationJournal of Intelligent Information Systems10.1007/s10844-024-00894-5Online publication date: 4-Nov-2024
https://doi.org/10.1007/s10844-024-00894-5
Montasari RMontasari R(2024)Machine Learning and Deep Learning Techniques in Countering CyberterrorismCyberspace, Cyberterrorism and the International Security in the Fourth Industrial Revolution10.1007/978-3-031-50454-9_8(135-158)Online publication date: 19-Jan-2024
https://doi.org/10.1007/978-3-031-50454-9_8
Show More Cited By

Index Terms

An Improved TF-IDF algorithm based on word frequency distribution information and category distribution information
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Feature selection

Recommendations

Naive Bayes Text Categorization Algorithm Based on TF-IDF Attribute Weighting
CSAI '18: Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence

As is known to us, Naive Bayes algorithm is a simple and efficient categorization algorithm. However, the assumption of conditional independence in this algorithm does not conform to objective reality which affects its categorization performance to some ...
Inter-Category Distribution Enhanced Feature Extraction for Efficient Text Classification
Big Data – BigData 2018
Abstract
Text data is one of the dominating data types in Big Data driven services and applications. The performance of text classification largely depends on the quality of feature extraction over the text corpus. For supervised learning over text ...
R-tfidf, a Variety of tf-idf Term Weighting Strategy in Document Categorization
SKG '11: Proceedings of the 2011 Seventh International Conference on Semantics, Knowledge and Grids

Term weighting strategy plays an essential role in the areas related to text processing such as text categorization and information retrieval. In such systems, term frequency, inverse document frequency, and document length normalization are important ...

Comments

Information & Contributors

Information

Published In

ICIIP '18: Proceedings of the 3rd International Conference on Intelligent Information Processing

May 2018

249 pages

ISBN:9781450364966

DOI:10.1145/3232116

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Guilin: Guilin University of Technology, Guilin, China
International Engineering and Technology Institute, Hong Kong: International Engineering and Technology Institute, Hong Kong

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICIIP '18

ICIIP '18: 2018 3rd International Conference on Intelligent Information Processing

May 19 - 20, 2018

Guilin, China

Acceptance Rates

Overall Acceptance Rate 87 of 367 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
466
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)2

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Alqahtani AIlyas M(2024)An Ensemble-Based Multi-Classification Machine Learning Classifiers Approach to Detect Multiple Classes of CyberbullyingMachine Learning and Knowledge Extraction10.3390/make60100096:1(156-170)Online publication date: 12-Jan-2024
https://doi.org/10.3390/make6010009
Sittar AMladenić DGrobelnik M(2024)News dissemination: a semantic approach to barrier classificationJournal of Intelligent Information Systems10.1007/s10844-024-00894-5Online publication date: 4-Nov-2024
https://doi.org/10.1007/s10844-024-00894-5
Montasari RMontasari R(2024)Machine Learning and Deep Learning Techniques in Countering CyberterrorismCyberspace, Cyberterrorism and the International Security in the Fourth Industrial Revolution10.1007/978-3-031-50454-9_8(135-158)Online publication date: 19-Jan-2024
https://doi.org/10.1007/978-3-031-50454-9_8
de Andrade CBelém FCunha WFrança CViegas FRocha LGonçalves M(2023)On the class separability of contextual embeddings representations – or “The classifier does not matter when the (text) representation is so good!”Information Processing and Management: an International Journal10.1016/j.ipm.2023.10333660:4Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1016/j.ipm.2023.103336
Szabó Nagy KKapusta JMunk M(2023)Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasksNeural Computing and Applications10.1007/s00521-023-08967-235:29(22055-22067)Online publication date: 7-Sep-2023
https://doi.org/10.1007/s00521-023-08967-2
Hanifi MChibane HHoussin RCavallucci D(2022)Problem formulation in inventive design using Doc2vec and Cosine Similarity as Artificial Intelligence methods and Scientific PapersEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.104661109:COnline publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1016/j.engappai.2022.104661
Chang SFrancis Siu MLi HLuo X(2022)Evolution pathways of robotic technologies and applications in constructionAdvanced Engineering Informatics10.1016/j.aei.2022.10152951:COnline publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1016/j.aei.2022.101529
Sabir BUllah FBabar MGaire R(2021)Machine Learning for Detecting Data ExfiltrationACM Computing Surveys10.1145/344218154:3(1-47)Online publication date: 8-May-2021
https://dl.acm.org/doi/10.1145/3442181
Malviya KRoy BSaritha S(2021)A Transformers Approach to Detect Depression in Social Media2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS)10.1109/ICAIS50930.2021.9395943(718-723)Online publication date: 25-Mar-2021
https://doi.org/10.1109/ICAIS50930.2021.9395943
Garg RKiwelekar ANetak LBhate S(2021)Personalization of News for a Logistics Organisation by Finding Relevancy Using NLPModern Approaches in Machine Learning and Cognitive Science: A Walkthrough10.1007/978-3-030-68291-0_16(215-226)Online publication date: 27-Apr-2021
https://doi.org/10.1007/978-3-030-68291-0_16
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Naive Bayes Text Categorization Algorithm Based on TF-IDF Attribute Weighting

Inter-Category Distribution Enhanced Feature Extraction for Efficient Text Classification

R-tfidf, a Variety of tf-idf Term Weighting Strategy in Document Categorization

Comments

Information

Published In

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations