Abstract
The term weighting scheme, which is used to convert the documents to vectors in the term space, is a vital step in automatic text categorization. The previous studies showed that term weighting schemes dominate the performance. There have been extensive studies on term weighting for English text classification. However, not many works have been studied on Vietnamese text classification.. In this paper, we proposed a term weighting scheme called normalize(tf.rf max ), which is based on tf.rf term weighting scheme – one of the most effective term weighting schemes to date. We conducted experiments to compare our proposed normalize(tf.rf max ) term weighting scheme to tf.rf and tf.idf on Vietnamese text classification benchmark. The results showed that our proposed term weighting scheme can achieve about 3 %–5 % accuracy better than other term weighting schemes.
References
Chang, C.C., Chih, J.L.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Debole, F., Fabrizio, S.: Supervised term weighting for automated text categorization. In: Sirmakessis, S. (ed.) Text Mining and Its Applications, pp. 81–97. Springer, Berlin, Heidelberg (2004)
Deng, Z.-H., Tang, S.-W., Yang, D.-Q., Li, M.Z.L.-Y., Xie, K.-Q.: A comparative study on feature weight in text categorization. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 588–597. Springer, Heidelberg (2004)
Hoang, V.C.D., et al.: A comparative study on Vietnamese text classification methods. In: 2007 IEEE International Conference on Research, Innovation and Vision for the Future. IEEE (2007)
Hsu, C.W., Chih, J.L.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)
Phuong, L.H., Huyên, N.T.M., Roussanaly, A., Vinh, H.T.: A hybrid approach to word segmentation of vietnamese texts. In: Martín-Vide, C., Otto, F., Fernau, H. (eds.) LATA 2008. LNCS, vol. 5196, pp. 240–249. Springer, Heidelberg (2008)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Berlin, Heidelberg (1998)
Lei, H., Govindaraju, V.: Half-against-half multi-class support vector machines. In: Oza, N.C., Polikar, R., Kittler, J., Roli, F. (eds.) MCS 2005. LNCS, vol. 3541, pp. 156–164. Springer, Heidelberg (2005)
Leopold, E., Jörg, K.: Text categorization with support vector machines. How to represent texts in input space? Mach. Learn. 46(1–3), 423–444 (2002)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Yang, Y., Jan, O.P.: A comparative study on feature selection in text categorization. In: ICML, vol. 97 (1997)
Acknowledgment
This research is funded by Vietnam National University, Ho Chi Minh City (VNU-HCM) under grant number C2014-26-04.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Nguyen, V.T., Hai, N.T., Nghia, N.H., Le, T.D. (2015). A Term Weighting Scheme Approach for Vietnamese Text Classification. In: Dang, T., Wagner, R., Küng, J., Thoai, N., Takizawa, M., Neuhold, E. (eds) Future Data and Security Engineering. FDSE 2015. Lecture Notes in Computer Science(), vol 9446. Springer, Cham. https://doi.org/10.1007/978-3-319-26135-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-26135-5_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26134-8
Online ISBN: 978-3-319-26135-5
eBook Packages: Computer ScienceComputer Science (R0)