An Improved Information Gain Algorithm Based on Relative Document Frequency Distribution

Peng, Jian; Yang, Xiao-Hua; Ouyang, Chun-Ping; Liu, Yong-Bin

doi:10.1007/978-3-319-50496-4_49

Jian Peng¹⁸,
Xiao-Hua Yang¹⁸,
Chun-Ping Ouyang¹⁸ &
…
Yong-Bin Liu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10102))

Included in the following conference series:

Abstract

Feature selection algorithm plays an important role in text categorization. Considering some drawbacks proposed from traditional and recently improved information gain (IG) approach, an improved IG feature selection method based on relative document frequency distribution is proposed, which combines reducing the impact of unbalanced data sets and low-frequency characteristics, the frequency distribution of features within category and the relative frequency document distribution of features among different categories. The experimental results of NLPCC-ICCPOL 2016 stance detection in Chinese microblogs show that the performance of the improved method is better than traditional IG approach and another improved method in feature selection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Article Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. ICML 97, 412–420 (1997)
Google Scholar
Shi, H., Jia, D.P., Miao, P.: Improved information gain text feature selection algorithm based on word frequency information. J. Comput. Appl. 34(11), 3279–3282 (2014)
Google Scholar
Guo, Y., Liu, X.: Study on information gain-based feature selection in Chinese text categorization. J. Comput. Eng. Appl. 48(27), 119–122 (2012)
Google Scholar
Xu, J., Jiang, H.: An improved information gain feature selection algorithm for SVM text classifier. In: 2015 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp. 273–276. IEEE Computer Society (2015)
Google Scholar
Xu, Y., Chen, L.: Term-frequency based feature selection methods for text categorization. In: Proceedings of the 2010 Fourth International Conference on Genetic and Evolutionary Computing, pp. 280–283. IEEE Press, Piscataway (2010)
Google Scholar
Mladenic, D., Grobelnk, M.: Feature selection for unbalanced class distribution and naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML 1999, pp. 258–267. ACM Press, New York (1999)
Google Scholar
Forman, G., Guyon, I., Elisseeff, A.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar
Ren, Y.G.: Information-gain-based text feature selection method. J. Comput. Sci. 39(11), 127–130 (2012)
Google Scholar
Ren, K.Q.: Feature reduction based on relative document frequency balance information gain. J. Jiangxi Univ. Sci. Technol. 29(5), 68–71 (2008)
Google Scholar
Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17, pp. 184–187. Association for Computational Linguistics (2003)
Google Scholar
Shi, C., Xu, C., Yang, X.: Study of TFIDF algorithm. J. Comput. Appl. 6(29), 167–170 (2009)
Google Scholar
Chang, G.C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Google Scholar

Download references

Acknowledgements

This research work is supported by National Natural Science Foundation of China (No. 61402220, No. 61502221), the Scientific Research Fund of Hunan Provincial Education Department (No. 14B153, No. 16C1378), the Philosophy and Social Science Foundation of Hunan Province (No. 14YBA335).

Author information

Authors and Affiliations

School of Computer Science and Technology, University of South China, Hengyang, 421001, China
Jian Peng, Xiao-Hua Yang, Chun-Ping Ouyang & Yong-Bin Liu

Authors

Jian Peng
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Hua Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chun-Ping Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Yong-Bin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao-Hua Yang .

Editor information

Editors and Affiliations

Microsoft Research Asia, Beijing, China
Chin-Yew Lin
Brandeis University, Waltham, Massachusetts, USA
Nianwen Xue
Peking University, Beijing, China
Dongyan Zhao
Fudan University, Shanghai, China
Xuanjing Huang
Peking University, Beijing, China
Yansong Feng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, J., Yang, XH., Ouyang, CP., Liu, YB. (2016). An Improved Information Gain Algorithm Based on Relative Document Frequency Distribution. In: Lin, CY., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds) Natural Language Understanding and Intelligent Applications. ICCPOL NLPCC 2016 2016. Lecture Notes in Computer Science(), vol 10102. Springer, Cham. https://doi.org/10.1007/978-3-319-50496-4_49

Download citation

DOI: https://doi.org/10.1007/978-3-319-50496-4_49
Published: 02 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50495-7
Online ISBN: 978-3-319-50496-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics