Abstract
Automated text categorization attempts to provide an effective solution to today’s unprecedented growth of textual data. Due to its capacity to organize a huge and varied amount of texts from which it is possible to gain invaluable insights, it has become an emerging investigative field for the research community. However, although several mathematical approaches have been studied to formalize the main components of a text categorization system: text representation, features extraction, and the classification process; such systems still face many difficulties due both to the complex nature of text databases and to the high dimensionality of texts representations. In this sense, this paper introduces an alternative way to process this problem. First, it starts by reducing the original set of features by using a newly proposed metric. And second, the added advantage of the proposed approach is that it automatically classifies a text without necessarily processing all its features. Moreover, some standard pretreatments such as stemming can be abandoned with this approach. The experimental results showed that this new text categorization method outperforms the state-of-the-art methods. As a result, the obtained f-measures on the 20 Newsgroups, BBC News, Reuters, and AG news datasets were, respectively, 95.06%, 98.21%, 88.44%, 95.70%, while standard approaches returned considerably lower scores.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Pérez-Rodríguez, G., Pérez-Pérez, M., Fdez-Riverola, F., Lourenço, A.: Online visibility of software-related web sites: the case of biomedical text mining tools. Inf. Process. Manag. 56(3), 565–583 (2019)
Hartmann, J., Huppertz, J., Schamp, C., Heitmann, M.: Comparing automated text classification methods. Int. J. Res. Mark. 36(1), 20–38 (2019)
Kakol, M., Nielek, R., Wierzbicki, A.: Understanding and predicting Web content credibility using the Content Credibility Corpus. Inf. Process. Manag. 53(5), 1043–1061 (2017)
Ahmed, H., Traore, I., Saad, S.: Detecting opinion spams and fake news using text classification. Secur Priv 1(1), e9 (2018)
Posadas-Durán, J.-P., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D., Chanona-Hernández, L.: Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput. 21(3), 627–639 (2017)
Giatsoglou, M., Vozalis, M.G., Diamantaras, K., Vakali, A., Sarigiannidis, G., Chatzisavvas, K.C.: Sentiment analysis leveraging emotions and word embeddings. Expert Syst. Appl. 69, 214–224 (2017)
Cherif, W., Madani, A., Kissi, M.: Towards an efficient opinion measurement in Arabic comments. Procedia Comput. Sci. 73, 122–129 (2015)
Petrenz, P., Webber, B.: Stable classification of text genres. Comput. Linguist. 37(2), 385–393 (2011)
Stavrianou, A., Andritsos, P., Nicoloyannis, N.: Overview and semantic issues of text mining. ACM Sigmod Rec. 36(3), 23–34 (2007)
Kostkina, A., Bodunkov, D., Klimov, V.: Document categorization based on usage of features reduction with synonyms clustering in weak semantic map. Procedia Comput. Sci. 145, 288–292 (2018)
Wang, R., Chen, G., Sui, X.: Multi label text classification method based on co-occurrence latent semantic vector space. Procedia Comput. Sci. 131, 756–764 (2018)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Manikandan, R., Sivakumar, R.: Machine learning algorithms for text-documents classification: a review. Mach. Learn. 3(2), 384–389 (2018)
Alostad, J.M.: Dimensionality scale back in massive datasets using PDLPP. J. Comput. Sci. 26, 141–146 (2018)
Leopold, E., May, M., Paaß, G.: Data mining and text mining for science and technology research. In: Handbook of quantitative science and technology research, pp. 187–213. Springer, Dordrecht (2004)
Virmani, D., Taneja, S.: A text preprocessing approach for efficacious information retrieval. In: Smart innovations in communication and computational sciences, pp. 13–22. Springer, Singapore (2019)
Joachims, T.: A Probabilistic analysis of the rocchio algorithm with TFIDF for text categorization (No. CMU-CS-96-118). Carnegie-mellon univ pittsburgh pa dept of computer science (1996)
Dogan, T., Uysal, A.K.: On term frequency factor in supervised term weighting schemes for text classification. Arab. J. Sci. Eng. 44, 1–16 (2019)
Guru, D.S., Suhil, M., Raju, L.N., Kumar, N.V.: An alternative framework for univariate filter-based feature selection for text categorization. Pattern Recognit. Lett. 103, 23–31 (2018)
Kim, D., Seo, D., Cho, S., Kang, P.: Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Inf. Sci. 477, 15–29 (2019)
Bai, V.M.A., Manimegalai, D.: Analysis of feature selection measures for text categorization. Int. J. Enterp. Netw. Manag. 8(1), 45–60 (2017)
Lang, K.: Newsweeder: learning to filter netnews. In: Machine learning proceedings 1995, pp. 331–339. Morgan Kaufmann (1995)
Maron, M.E.: Automatic indexing: an experimental inquiry. J. ACM (JACM) 8(3), 404–417 (1961)
Sebastiani, F.: Text categorization. In: Encyclopedia of database technologies and applications, pp. 683–687. IGI Global (2005)
Hayes, P.J., Andersen, P.M., Nirenburg, I.B., Schmandt, L.M.: Tcs: a shell for content-based text categorization. In: Sixth conference on artificial intelligence for applications, pp. 320–326. IEEE (1990)
Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retrieval 1(1–2), 69–90 (1999)
Xu, S.: Bayesian Naïve Bayes classifiers to text classification. J. Inf. Sci. 44(1), 48–59 (2018)
Zhang, L., Jiang, L., Li, C., Kong, G.: Two feature weighting approaches for naive Bayes text classifiers. Knowl.-Based Syst. 100, 137–144 (2016)
Hassaine, A., Mecheter, S., Jaoua, A.: Text categorization using hyper rectangular keyword extraction: application to news articles classification. In: International conference on relational and algebraic methods in computer science, pp. 312–325. Springer, Cham (2015)
Ghareb, A.S., Bakar, A.A., Hamdan, A.R.: Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl. 49, 31–47 (2016)
Nikhath, A.K., Subrahmanyam, K., Vasavi, R.: Building a K-nearest neighbor classifier for text categorization. Int. J. Comput. Sci. Inf. Technol. 7(1), 254–256 (2016)
Jo, T.: String vector based KNN for text categorization. In: 2018 20th international conference on advanced communication technology (ICACT), pp. 438–443. IEEE (2018)
Yu, B., Xu, Z.B., Li, C.H.: Latent semantic analysis for text categorization using neural network. Knowl.-Based Syst. 21(8), 900–904 (2008)
Ramesh, B., Sathiaseelan, J.G.R.: An advanced multi class instance selection-based support vector machine for text classification. Procedia Comput. Sci. 57, 1124–1130 (2015)
Goudjil, M., Koudil, M., Bedda, M., Ghoggali, N.: A novel active learning method using SVM for text classification. Int. J. Autom. Comput. 15, 1–9 (2018)
Deng, X., Li, Y., Weng, J., Zhang, J.: Feature selection for text classification: a review. Multimed. Tools Appl. 78(3), 3797–3816 (2019)
Tang, X., Dai, Y., Xiang, Y.: Feature selection based on feature interactions with application to text categorization. Expert Syst. Appl. 120, 207–216 (2019)
Banks, G.C., Woznyj, H.M., Wesslen, R.S., Ross, R.L.: A review of best practice recommendations for text analysis in R (and a user-friendly app). J. Bus. Psychol. 33(4), 445–459 (2018)
Cherif, W., Madani, A., Kissi, M.: New rules-based algorithm to improve Arabic stemming accuracy. Int. J. Knowl. Eng. Data Min. 3(3–4), 315–336 (2015)
Das, A.K., Das, A.K., Sarkar, A.: An Evolutionary Algorithm-Based Text Categorization Technique. In: Computational intelligence in data mining, pp. 851–861. Springer, Singapore (2019)
Murphy, G., & Cubranic, D.: Automatic bug triage using text categorization. In: Proceedings of the sixteenth international conference on software engineering and knowledge engineering, pp. 261–272 (2004)
Gupta, V., Lehal, G.S.: A survey of text mining techniques and applications. J. Emerg. Technol. Web Intell. 1(1), 60–76 (2009)
Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 6(1), 80–89 (2004)
Jo, T.: K nearest neighbor for text categorization using feature similarity. In: Advanced engineering and ICT–convergence 2019 (ICAEIC-2019), p. 99 (2019)
Langlois, A., Nie, J.Y., Thomas, J., Hong, Q.N., Pluye, P.: Discriminating between empirical studies and nonempirical works using automated text classification. Res. Synth. Methods 9(4), 587–601 (2018)
Zhang, T., Ge, S.S.: An improved TF-IDF algorithm based on class discriminative strength for text categorization on desensitized data. In: Proceedings of the 2019 3rd international conference on innovation in artificial intelligence, pp. 39–44. ACM (2019)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Rehman, A., Javed, K., Babri, H.A.: Feature selection based on a normalized difference measure for text classification. Inf. Process. Manag. 53(2), 473–489 (2017)
Hussain, S., Keung, J., Khan, A.A., Ahmad, A., Cuomo, S., Piccialli, F., Jeon, G., Akhunzada, A.: Implications of deep learning for the automation of design patterns organization. J. Parallel Distrib. Comput. 117, 256–266 (2018)
Premchander, K., Sarma, S.S.V.N., Vaishali, K., Reddy, P.V., Anjaneyulu, M., Nagaprasad, S.: WordNet-based text categorization using convolutional neural networks. In: Proceedings of International Conference on Recent Advancement on Computer and Communication, pp. 243–251. Springer, Singapore (2018)
Tao, X., Yaling, W., Nan, M.: Convolutional neural network based on word sense disambiguation for text classification. Appl. Res. Comput. 5, 10 (2018)
Wang, X., Kim, H.C.: Text categorization with improved deep learning methods. J. Inf. Commun. Converg. Eng. 16(2), 106–113 (2018)
Škrlj, B., Kralj, J., Lavrač, N., Pollak, S.: Towards robust text classification with semantics-aware recurrent neural architecture. Mach. Learn. Knowl. Extr. 1(2), 575–589 (2019)
Jiang, M., Liang, Y., Feng, X., Fan, X., Pei, Z., Xue, Y., Guan, R.: Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29(1), 61–70 (2018)
Tellez, E.S., Moctezuma, D., Miranda-Jiménez, S., Graff, M.: An automated text categorization framework based on hyperparameter optimization. Knowl.-Based Syst. 149, 110–123 (2018)
Shah, F.P., Patel, V.: A review on feature selection and feature extraction for text classification. In: 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), pp. 2264–2268. IEEE (2016)
Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd International Conference on Machine learning, pp. 377–384 (2006)
Leopold, E., Kindermann, J.: Text categorization with support vector machines. How to represent texts in input space? Mach. Learn. 46(1–3), 423–444 (2002)
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–437 (2009)
Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: 2015 IEEE 14th International Conference on Cognitive Informatics and Cognitive Computing (ICCI* CC), pp. 136–140 (2015)
Labani, M., Moradi, P., Ahmadizar, F., Jalili, M.: A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70, 25–37 (2018)
Bramesh, S.M., Kumar, K.A.: Empirical study to evaluate the performance of classification algorithms on public datasets. In: Emerging Research in Electronics, Computer Science and Technology, pp. 447–455. Springer, Singapore (2019)
Chowdhury, S.B.R., Annervaz, K.M., Dukkipati, A.: Instance-based inductive deep transfer learning by cross-dataset querying with locality sensitive hashing (2018)
Pappagari, R., Villalba, J., Dehak, N.: Joint verification-identification in end-to-end multi-scale CNN framework for topic identification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203 (2018)
Kadhim, A.I., Cheah, Y.N., Ahamed, N.H.: Text document preprocessing and dimension reduction techniques for text document clustering. In: 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology, pp. 69–73. IEEE (2014)
Camacho-Collados, J., Pilehvar, M.T.: On the role of text preprocessing in neural network architectures: an evaluation study on text categorization and sentiment analysis (2017). arXiv:1707.01780
Asim, M.N., Khan, M.U.G., Malik, M.I., Dengel, A., Ahmed, S.: A robust hybrid approach for textual document classification. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1390–1396. IEEE (2019)
Elghannam, F.: Text representation and classification based on bi-gram alphabet. J. King Saud Univ. Comput. Inf. Sci. 33(2), 235–242 (2021)
Pradhan, L., Taneja, N.A., Dixit, C., Suhag, M.: Comparison of text classifiers on news articles. Int. Res. J. Eng. Technol. 4(3), 2513–2517 (2017)
Aziguli, W., Zhang, Y., Xie, Y., Zhang, D., Luo, X., Li, C., & Zhang, Y.: A robust text classifier based on denoising deep neural network in the analysis of big data. Sci. Program. 2017(1), 3610378 (2017)
Al-Salemi, B., Ayob, M., Noah, S.A.M.: Feature ranking for enhancing boosting-based multi-label text categorization. Expert Syst. Appl. 113, 531–543 (2018)
Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K.: An kNN model-based approach and its application in text categorization. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 559–570. Springer, Berlin, Heidelberg (2004)
Yogatama, D., Dyer, C., Ling, W., Blunsom, P.: Generative and discriminative text classification with recurrent neural networks (2017). arXiv:1703.01898
Wang, J., Wang, Z., Zhang, D., Yan, J.: Combining knowledge with deep convolutional neural networks for short text classification. In: IJCAI, vol. 350 (2017)
Wang, B.: Disconnected recurrent neural networks for text categorization. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2311–2320 (2018)
Marivate, V., Sefara, T.: Improving short text classification through global augmentation methods. In: International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pp. 385–399. Springer, Cham (2020)
Khalifi, H., Cherif, W., El Qadi, A., Ghanou, Y.: Query expansion based on clustering and personalized information retrieval. Prog. Artif. Intell. 8(2), 241–251 (2019)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cherif, W., Madani, A. & Kissi, M. Text categorization based on a new classification by thresholds. Prog Artif Intell 10, 433–447 (2021). https://doi.org/10.1007/s13748-021-00247-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-021-00247-1