Abstract
Feature weighting is an important phase of text categorization, which computes the feature weight for each feature of documents. This paper proposes three new feature weighting methods for text categorization. In the first and second proposed methods, traditional feature weighting method tf×idf is combined with “one-side” feature selection metrics (i.e. odds ratio, correlation coefficient) in a moderate manner, and positive and negative features are weighted separately. tf×idf+CC and tf×idf+OR are used to calculate the feature weights. In the third method, tf is combined with feature entropy, which is effective and concise. The feature entropy measures the diversity of feature’s document frequency in different categories. The experimental results on Reuters-21578 corpus show that the proposed methods outperform several state-of-the-art feature weighting methods, such as tf×idf, tf×CHI, andtf×OR.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: European of Conference on Machine Learning, Chemnitz, pp. 137–142 (1998)
Yang, Y., Chute, C.G.: An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems 12, 252–277 (1994)
Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text categorization. In: IJCAI 1999 Workshop on Machine Learning for Information Filtering, Stockholm, pp. 61–67 (1999)
Schapier, R.E.: Boostexter: A boosting-based system for text categorization. Machine Learning 39, 135–168 (2000)
Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: International Conference on Machine Learning, pp. 412–520 (1997)
Sebastiani, F.: Machine learning in automated text categorization. Computing Surveys 34, 1–47 (2002)
Zheng, Z.H., Wu, X.Y., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter 6, 80–89 (2004)
Zheng, Z.H., Srihari, R., Srihari, S.: A feature selection framework for text filtering. In: 3rd IEEE International Conference on Data Mining, Melbourne, pp. 705–708 (2003)
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. Studies in Fuzziness and Soft Computing 138, 71–98 (2004)
Deng, Z.H., Tang, S.W., Yang, D.Q., Li, L.Y., Xie, K.Q.: A comparative study on feature weight in text categorization. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 588–597. Springer, Heidelberg (2004)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)
Rijsbergen, V.: Information Retrieval. Butterworths, London (1979)
Mladenic, D., Grobelnik, M.: Feature selection for classification based on text hierarchy. In: Conference on Automated Learning and Discovery, the Workshop on Learning from Text and the Web, Pittsburg (1998)
Ng, W., Goh, H., Low, K.: Feature selection, perceptron learning, and a usability case study for text categorization. ACM SIGIR Forum 31, 67–73 (1997)
Chang, C., Lin, C.: LibSVM: a library for support vector machines, http://www.csie.ntu.edu.tw/cjlin/libsvm
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xue, W., Xu, X. (2010). Three New Feature Weighting Methods for Text Categorization. In: Wang, F.L., Gong, Z., Luo, X., Lei, J. (eds) Web Information Systems and Mining. WISM 2010. Lecture Notes in Computer Science, vol 6318. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16515-3_44
Download citation
DOI: https://doi.org/10.1007/978-3-642-16515-3_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16514-6
Online ISBN: 978-3-642-16515-3
eBook Packages: Computer ScienceComputer Science (R0)