ABSTRACT
Traditional TF-IDF (Term Frequency-Inverse Document Frequency) feature weighting algorithm only uses word frequency information as a measure of the importance of feature items in the data set. This results in the inability to correctly reflect the differences between documents of different categories. This paper proposes an improved feature weighting algorithm FDCD-TF-IDF based on word frequency distribution information and category distribution information. The improved algorithm introduces the concept of word frequency distribution and class distribution to describe the weight of the feature item more accurately. The word frequency distribution is mainly aimed at the correlation between feature items and categories, and the category distribution can better reflect category information of feature items. This improved algorithm can accurately reflect the differences between different text categories. The experimental results show that the improved algorithm can achieve better classification results on both balanced and unbalanced text data sets.
- Brooks M, Amershi S, Lee B, et al. FeatureInsight: Visual support for error-driven feature ideation in text classification{C}// Visual Analytics Science and Technology. IEEE, 2015:105--112.Google Scholar
- Chandrashekar G, Sahin F. A survey on feature selection methods{M}. Pergamon Press, Inc. 2014.Google Scholar
- Chunxia T. Research on the Multilevel Security Authorization Method Based on Image Content{J}. 2017.Google Scholar
- Jie F, Xiaojun L. Design of Upright Intelligent Vehicle Based on Camera{J}. 2017.Google Scholar
- Haque M M, Pervin S, Begum Z. Automatic Bengali news documents summarization by introducing sentence frequency and clustering{C}// International Conference on Computer and Information Technology. IEEE, 2016:156--160.Google Scholar
- Tang B, He H, Baggenstoss P M, et al. A Bayesian Classification Approach Using Class-Specific Features for Text Categorization{J}. IEEE Transactions on Knowledge & Data Engineering, 2016, 28(6):1602--1606. Google ScholarDigital Library
- Uysal A K, Gunal S. The impact of preprocessing on text classification{J}. Information Processing & Management, 2014, 50(1):104--112. Google ScholarDigital Library
- Bruno T, Sasa M, Dzenana D, et al. KNN with TF-IDF based framework for text categorization{C}// Daaam International Symposium on Intelligent Manufacturing and Automation. 2013:1356--1364.Google Scholar
- How B C, Narayanan K. An Empirical Study of Feature Selection for Text Categorization based on Term Weightage{C}// Web Intelligence, 2004. WI 2004. Proceedings. IEEE/WIC/ACM International Conference on. IEEE, 2004:599--602. Google ScholarDigital Library
- Vidal M, Menezes G V, Berlt K, et al. Selecting keywords to represent web pages using Wikipedia information{C}// Brazilian Symposium on Multimedia and the Web. 2012:375--382. Google ScholarDigital Library
- Liu M, Yang J. An improvement of TFIDF weighting in text categorization{J}. International Proceedings of Computer Science & Information Tech, 2012.Google Scholar
- Zhou Y, Tang J, Wang J. An Improved TFIDF Feature Selection Algorithm Based On Information Entropy{C}// Chinese Control Conference. IEEE, 2007:312--315.Google Scholar
- Selvi S T, Karthikeyan P, Vincent A, et al. Text categorization using Rocchio algorithm and random forest algorithm{C}// Eighth International Conference on Advanced Computing. IEEE, 2017:7--12Google Scholar
Index Terms
- An Improved TF-IDF algorithm based on word frequency distribution information and category distribution information
Recommendations
Naive Bayes Text Categorization Algorithm Based on TF-IDF Attribute Weighting
CSAI '18: Proceedings of the 2018 2nd International Conference on Computer Science and Artificial IntelligenceAs is known to us, Naive Bayes algorithm is a simple and efficient categorization algorithm. However, the assumption of conditional independence in this algorithm does not conform to objective reality which affects its categorization performance to some ...
Inter-Category Distribution Enhanced Feature Extraction for Efficient Text Classification
Big Data – BigData 2018AbstractText data is one of the dominating data types in Big Data driven services and applications. The performance of text classification largely depends on the quality of feature extraction over the text corpus. For supervised learning over text ...
R-tfidf, a Variety of tf-idf Term Weighting Strategy in Document Categorization
SKG '11: Proceedings of the 2011 Seventh International Conference on Semantics, Knowledge and GridsTerm weighting strategy plays an essential role in the areas related to text processing such as text categorization and information retrieval. In such systems, term frequency, inverse document frequency, and document length normalization are important ...
Comments