Abstract
In this study, computation of compact document vectors by utilizing both terms and termsets for binary text categorization is addressed. In general, termsets are concatenated with all terms, leading to large document vectors. Selection of a subset of terms and termsets for compact but also effective representation of documents is considered in this study. Two different methods are studied for this purpose. In the first method, combination of terms and termsets in different proportions is evaluated. As an alternative approach, normalized ranking scores of terms and termsets are employed for subset selection. Experiments conducted on two widely used datasets have shown that termsets can effectively complement terms also in cases when small number of features are used to represent documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 513–523 (1988)
Zhang, W., Yoshida, T., Tang, X.: Text classification based on multi-word with support vector machine. Knowl.-Based Syst. 21(8), 879–886 (2008)
Tripathy, A., Agrawal, A., Rath, S.K.: Classification of sentiment reviews using n-gram machine learning approach. Expert Syst. Appl. 57, 117–126 (2016)
Zhai, Z., Xu, H., Kang, B., Jia, P.: Exploiting effective features for chinese sentiment classification. Expert Syst. Appl. 38, 9139–9146 (2011)
Tesar, R., Poesio, M., Strnad, V., Jezek, K.: Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: Proceedings of the 2006 ACM Symposium on Document Engineering, pp. 138–146. ACM, New York (2006)
Zaïane, O.R., Antonie, M.L.: Classifying text documents by associating terms with text categories. In: Proceedings of the 13th Australasian Database Conference ADC 2002, vol. 5, pp. 215–222. Australian Computer Society, Inc. (2002)
Figueiredo, F., Rocha, L., Couto, T., Salles, T., Gonçalves, M.A., Meira, W.: Word co-occurrence features for text classification. Inf. Syst. 36(5), 843–858 (2011)
Badawi, D., Altınçay, H.: A novel framework for termset selection and weighting in binary text classification. Eng. Appl. Artif. Intell. 35, 38–53 (2014)
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)
Ogura, H., Amano, H., Kondo, M.: Comparison of metrics for feature selection in imbalanced text classification. Expert Syst. Appl. 38(5), 4978–4989 (2011)
Rossi, R.G., Rezende, S.O.: Building a topic hierarchy using the bag-of-related-words representation. In: DocEng, pp. 195–204. ACM, New York (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Badawi, D., Altınçay, H. (2018). Compact Representation of Documents Using Terms and Termsets. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2018. Lecture Notes in Computer Science(), vol 10934. Springer, Cham. https://doi.org/10.1007/978-3-319-96136-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-96136-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96135-4
Online ISBN: 978-3-319-96136-1
eBook Packages: Computer ScienceComputer Science (R0)