Text categorization based on a new classification by thresholds

Cherif, Walid; Madani, Abdellah; Kissi, Mohamed

doi:10.1007/s13748-021-00247-1

Text categorization based on a new classification by thresholds

Regular Paper
Published: 03 June 2021

Volume 10, pages 433–447, (2021)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

383 Accesses
4 Citations
Explore all metrics

Abstract

Automated text categorization attempts to provide an effective solution to today’s unprecedented growth of textual data. Due to its capacity to organize a huge and varied amount of texts from which it is possible to gain invaluable insights, it has become an emerging investigative field for the research community. However, although several mathematical approaches have been studied to formalize the main components of a text categorization system: text representation, features extraction, and the classification process; such systems still face many difficulties due both to the complex nature of text databases and to the high dimensionality of texts representations. In this sense, this paper introduces an alternative way to process this problem. First, it starts by reducing the original set of features by using a newly proposed metric. And second, the added advantage of the proposed approach is that it automatically classifies a text without necessarily processing all its features. Moreover, some standard pretreatments such as stemming can be abandoned with this approach. The experimental results showed that this new text categorization method outperforms the state-of-the-art methods. As a result, the obtained f-measures on the 20 Newsgroups, BBC News, Reuters, and AG news datasets were, respectively, 95.06%, 98.21%, 88.44%, 95.70%, while standard approaches returned considerably lower scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved Document Categorization Through Feature-Rich Combinations

Supervised Machine Learning Text Classification: A Review

Assessing Intelligence Text Classification Techniques

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Pérez-Rodríguez, G., Pérez-Pérez, M., Fdez-Riverola, F., Lourenço, A.: Online visibility of software-related web sites: the case of biomedical text mining tools. Inf. Process. Manag. 56(3), 565–583 (2019)
Article Google Scholar
Hartmann, J., Huppertz, J., Schamp, C., Heitmann, M.: Comparing automated text classification methods. Int. J. Res. Mark. 36(1), 20–38 (2019)
Article Google Scholar
Kakol, M., Nielek, R., Wierzbicki, A.: Understanding and predicting Web content credibility using the Content Credibility Corpus. Inf. Process. Manag. 53(5), 1043–1061 (2017)
Article Google Scholar
Ahmed, H., Traore, I., Saad, S.: Detecting opinion spams and fake news using text classification. Secur Priv 1(1), e9 (2018)
Article Google Scholar
Posadas-Durán, J.-P., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D., Chanona-Hernández, L.: Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput. 21(3), 627–639 (2017)
Article Google Scholar
Giatsoglou, M., Vozalis, M.G., Diamantaras, K., Vakali, A., Sarigiannidis, G., Chatzisavvas, K.C.: Sentiment analysis leveraging emotions and word embeddings. Expert Syst. Appl. 69, 214–224 (2017)
Article Google Scholar
Cherif, W., Madani, A., Kissi, M.: Towards an efficient opinion measurement in Arabic comments. Procedia Comput. Sci. 73, 122–129 (2015)
Article Google Scholar
Petrenz, P., Webber, B.: Stable classification of text genres. Comput. Linguist. 37(2), 385–393 (2011)
Article Google Scholar
Stavrianou, A., Andritsos, P., Nicoloyannis, N.: Overview and semantic issues of text mining. ACM Sigmod Rec. 36(3), 23–34 (2007)
Article Google Scholar
Kostkina, A., Bodunkov, D., Klimov, V.: Document categorization based on usage of features reduction with synonyms clustering in weak semantic map. Procedia Comput. Sci. 145, 288–292 (2018)
Article Google Scholar
Wang, R., Chen, G., Sui, X.: Multi label text classification method based on co-occurrence latent semantic vector space. Procedia Comput. Sci. 131, 756–764 (2018)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Article Google Scholar
Manikandan, R., Sivakumar, R.: Machine learning algorithms for text-documents classification: a review. Mach. Learn. 3(2), 384–389 (2018)
Alostad, J.M.: Dimensionality scale back in massive datasets using PDLPP. J. Comput. Sci. 26, 141–146 (2018)
Article MathSciNet Google Scholar
Leopold, E., May, M., Paaß, G.: Data mining and text mining for science and technology research. In: Handbook of quantitative science and technology research, pp. 187–213. Springer, Dordrecht (2004)
Virmani, D., Taneja, S.: A text preprocessing approach for efficacious information retrieval. In: Smart innovations in communication and computational sciences, pp. 13–22. Springer, Singapore (2019)
Joachims, T.: A Probabilistic analysis of the rocchio algorithm with TFIDF for text categorization (No. CMU-CS-96-118). Carnegie-mellon univ pittsburgh pa dept of computer science (1996)
Dogan, T., Uysal, A.K.: On term frequency factor in supervised term weighting schemes for text classification. Arab. J. Sci. Eng. 44, 1–16 (2019)
Article Google Scholar
Guru, D.S., Suhil, M., Raju, L.N., Kumar, N.V.: An alternative framework for univariate filter-based feature selection for text categorization. Pattern Recognit. Lett. 103, 23–31 (2018)
Article Google Scholar
Kim, D., Seo, D., Cho, S., Kang, P.: Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Inf. Sci. 477, 15–29 (2019)
Article Google Scholar
Bai, V.M.A., Manimegalai, D.: Analysis of feature selection measures for text categorization. Int. J. Enterp. Netw. Manag. 8(1), 45–60 (2017)
Google Scholar
Lang, K.: Newsweeder: learning to filter netnews. In: Machine learning proceedings 1995, pp. 331–339. Morgan Kaufmann (1995)
Maron, M.E.: Automatic indexing: an experimental inquiry. J. ACM (JACM) 8(3), 404–417 (1961)
Article MATH Google Scholar
Sebastiani, F.: Text categorization. In: Encyclopedia of database technologies and applications, pp. 683–687. IGI Global (2005)
Hayes, P.J., Andersen, P.M., Nirenburg, I.B., Schmandt, L.M.: Tcs: a shell for content-based text categorization. In: Sixth conference on artificial intelligence for applications, pp. 320–326. IEEE (1990)
Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retrieval 1(1–2), 69–90 (1999)
Article Google Scholar
Xu, S.: Bayesian Naïve Bayes classifiers to text classification. J. Inf. Sci. 44(1), 48–59 (2018)
Article Google Scholar
Zhang, L., Jiang, L., Li, C., Kong, G.: Two feature weighting approaches for naive Bayes text classifiers. Knowl.-Based Syst. 100, 137–144 (2016)
Article Google Scholar
Hassaine, A., Mecheter, S., Jaoua, A.: Text categorization using hyper rectangular keyword extraction: application to news articles classification. In: International conference on relational and algebraic methods in computer science, pp. 312–325. Springer, Cham (2015)
Ghareb, A.S., Bakar, A.A., Hamdan, A.R.: Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl. 49, 31–47 (2016)
Article Google Scholar
Nikhath, A.K., Subrahmanyam, K., Vasavi, R.: Building a K-nearest neighbor classifier for text categorization. Int. J. Comput. Sci. Inf. Technol. 7(1), 254–256 (2016)
Google Scholar
Jo, T.: String vector based KNN for text categorization. In: 2018 20th international conference on advanced communication technology (ICACT), pp. 438–443. IEEE (2018)
Yu, B., Xu, Z.B., Li, C.H.: Latent semantic analysis for text categorization using neural network. Knowl.-Based Syst. 21(8), 900–904 (2008)
Article Google Scholar
Ramesh, B., Sathiaseelan, J.G.R.: An advanced multi class instance selection-based support vector machine for text classification. Procedia Comput. Sci. 57, 1124–1130 (2015)
Article Google Scholar
Goudjil, M., Koudil, M., Bedda, M., Ghoggali, N.: A novel active learning method using SVM for text classification. Int. J. Autom. Comput. 15, 1–9 (2018)
Article Google Scholar
Deng, X., Li, Y., Weng, J., Zhang, J.: Feature selection for text classification: a review. Multimed. Tools Appl. 78(3), 3797–3816 (2019)
Article Google Scholar
Tang, X., Dai, Y., Xiang, Y.: Feature selection based on feature interactions with application to text categorization. Expert Syst. Appl. 120, 207–216 (2019)
Article Google Scholar
Banks, G.C., Woznyj, H.M., Wesslen, R.S., Ross, R.L.: A review of best practice recommendations for text analysis in R (and a user-friendly app). J. Bus. Psychol. 33(4), 445–459 (2018)
Article Google Scholar
Cherif, W., Madani, A., Kissi, M.: New rules-based algorithm to improve Arabic stemming accuracy. Int. J. Knowl. Eng. Data Min. 3(3–4), 315–336 (2015)
Article Google Scholar
Das, A.K., Das, A.K., Sarkar, A.: An Evolutionary Algorithm-Based Text Categorization Technique. In: Computational intelligence in data mining, pp. 851–861. Springer, Singapore (2019)
Murphy, G., & Cubranic, D.: Automatic bug triage using text categorization. In: Proceedings of the sixteenth international conference on software engineering and knowledge engineering, pp. 261–272 (2004)
Gupta, V., Lehal, G.S.: A survey of text mining techniques and applications. J. Emerg. Technol. Web Intell. 1(1), 60–76 (2009)
Google Scholar
Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 6(1), 80–89 (2004)
Article Google Scholar
Jo, T.: K nearest neighbor for text categorization using feature similarity. In: Advanced engineering and ICT–convergence 2019 (ICAEIC-2019), p. 99 (2019)
Langlois, A., Nie, J.Y., Thomas, J., Hong, Q.N., Pluye, P.: Discriminating between empirical studies and nonempirical works using automated text classification. Res. Synth. Methods 9(4), 587–601 (2018)
Article Google Scholar
Zhang, T., Ge, S.S.: An improved TF-IDF algorithm based on class discriminative strength for text categorization on desensitized data. In: Proceedings of the 2019 3rd international conference on innovation in artificial intelligence, pp. 39–44. ACM (2019)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar
Rehman, A., Javed, K., Babri, H.A.: Feature selection based on a normalized difference measure for text classification. Inf. Process. Manag. 53(2), 473–489 (2017)
Article Google Scholar
Hussain, S., Keung, J., Khan, A.A., Ahmad, A., Cuomo, S., Piccialli, F., Jeon, G., Akhunzada, A.: Implications of deep learning for the automation of design patterns organization. J. Parallel Distrib. Comput. 117, 256–266 (2018)
Article Google Scholar
Premchander, K., Sarma, S.S.V.N., Vaishali, K., Reddy, P.V., Anjaneyulu, M., Nagaprasad, S.: WordNet-based text categorization using convolutional neural networks. In: Proceedings of International Conference on Recent Advancement on Computer and Communication, pp. 243–251. Springer, Singapore (2018)
Tao, X., Yaling, W., Nan, M.: Convolutional neural network based on word sense disambiguation for text classification. Appl. Res. Comput. 5, 10 (2018)
Google Scholar
Wang, X., Kim, H.C.: Text categorization with improved deep learning methods. J. Inf. Commun. Converg. Eng. 16(2), 106–113 (2018)
Google Scholar
Škrlj, B., Kralj, J., Lavrač, N., Pollak, S.: Towards robust text classification with semantics-aware recurrent neural architecture. Mach. Learn. Knowl. Extr. 1(2), 575–589 (2019)
Article Google Scholar
Jiang, M., Liang, Y., Feng, X., Fan, X., Pei, Z., Xue, Y., Guan, R.: Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29(1), 61–70 (2018)
Article Google Scholar
Tellez, E.S., Moctezuma, D., Miranda-Jiménez, S., Graff, M.: An automated text categorization framework based on hyperparameter optimization. Knowl.-Based Syst. 149, 110–123 (2018)
Article Google Scholar
Shah, F.P., Patel, V.: A review on feature selection and feature extraction for text classification. In: 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), pp. 2264–2268. IEEE (2016)
Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd International Conference on Machine learning, pp. 377–384 (2006)
Leopold, E., Kindermann, J.: Text categorization with support vector machines. How to represent texts in input space? Mach. Learn. 46(1–3), 423–444 (2002)
Article MATH Google Scholar
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–437 (2009)
Article Google Scholar
Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: 2015 IEEE 14th International Conference on Cognitive Informatics and Cognitive Computing (ICCI* CC), pp. 136–140 (2015)
Labani, M., Moradi, P., Ahmadizar, F., Jalili, M.: A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70, 25–37 (2018)
Article Google Scholar
Bramesh, S.M., Kumar, K.A.: Empirical study to evaluate the performance of classification algorithms on public datasets. In: Emerging Research in Electronics, Computer Science and Technology, pp. 447–455. Springer, Singapore (2019)
Chowdhury, S.B.R., Annervaz, K.M., Dukkipati, A.: Instance-based inductive deep transfer learning by cross-dataset querying with locality sensitive hashing (2018)
Pappagari, R., Villalba, J., Dehak, N.: Joint verification-identification in end-to-end multi-scale CNN framework for topic identification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203 (2018)
Kadhim, A.I., Cheah, Y.N., Ahamed, N.H.: Text document preprocessing and dimension reduction techniques for text document clustering. In: 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology, pp. 69–73. IEEE (2014)
Camacho-Collados, J., Pilehvar, M.T.: On the role of text preprocessing in neural network architectures: an evaluation study on text categorization and sentiment analysis (2017). arXiv:1707.01780
Asim, M.N., Khan, M.U.G., Malik, M.I., Dengel, A., Ahmed, S.: A robust hybrid approach for textual document classification. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1390–1396. IEEE (2019)
Elghannam, F.: Text representation and classification based on bi-gram alphabet. J. King Saud Univ. Comput. Inf. Sci. 33(2), 235–242 (2021)
Pradhan, L., Taneja, N.A., Dixit, C., Suhag, M.: Comparison of text classifiers on news articles. Int. Res. J. Eng. Technol. 4(3), 2513–2517 (2017)
Google Scholar
Aziguli, W., Zhang, Y., Xie, Y., Zhang, D., Luo, X., Li, C., & Zhang, Y.: A robust text classifier based on denoising deep neural network in the analysis of big data. Sci. Program. 2017(1), 3610378 (2017)
Al-Salemi, B., Ayob, M., Noah, S.A.M.: Feature ranking for enhancing boosting-based multi-label text categorization. Expert Syst. Appl. 113, 531–543 (2018)
Article Google Scholar
Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K.: An kNN model-based approach and its application in text categorization. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 559–570. Springer, Berlin, Heidelberg (2004)
Yogatama, D., Dyer, C., Ling, W., Blunsom, P.: Generative and discriminative text classification with recurrent neural networks (2017). arXiv:1703.01898
Wang, J., Wang, Z., Zhang, D., Yan, J.: Combining knowledge with deep convolutional neural networks for short text classification. In: IJCAI, vol. 350 (2017)
Wang, B.: Disconnected recurrent neural networks for text categorization. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2311–2320 (2018)
Marivate, V., Sefara, T.: Improving short text classification through global augmentation methods. In: International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pp. 385–399. Springer, Cham (2020)
Khalifi, H., Cherif, W., El Qadi, A., Ghanou, Y.: Query expansion based on clustering and personalized information retrieval. Prog. Artif. Intell. 8(2), 241–251 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory SI2M, Department of Computer Science, National Institute of Statistics and Applied Economics, Rabat-Institutes, B.P. 6217, Rabat, Morocco
Walid Cherif
Laboratory LAROSERI, Department of Computer Science, Faculty of Sciences, University Chouaib Doukkali, B.P. 20, 24000, El Jadida, Morocco
Abdellah Madani
Laboratory LIM, Department of Computer Science, Faculty of Sciences and Technology, University Hassan II Casablanca, B.P. 146, 20650, Mohammedia, Morocco
Mohamed Kissi

Authors

Walid Cherif
View author publications
You can also search for this author in PubMed Google Scholar
Abdellah Madani
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Kissi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Walid Cherif.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cherif, W., Madani, A. & Kissi, M. Text categorization based on a new classification by thresholds. Prog Artif Intell 10, 433–447 (2021). https://doi.org/10.1007/s13748-021-00247-1

Download citation

Received: 26 November 2019
Accepted: 13 May 2021
Published: 03 June 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s13748-021-00247-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text categorization based on a new classification by thresholds

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improved Document Categorization Through Feature-Rich Combinations

Supervised Machine Learning Text Classification: A Review

Assessing Intelligence Text Classification Techniques

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now