Skip to main content

Advertisement

Log in

Text categorization based on a new classification by thresholds

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

Automated text categorization attempts to provide an effective solution to today’s unprecedented growth of textual data. Due to its capacity to organize a huge and varied amount of texts from which it is possible to gain invaluable insights, it has become an emerging investigative field for the research community. However, although several mathematical approaches have been studied to formalize the main components of a text categorization system: text representation, features extraction, and the classification process; such systems still face many difficulties due both to the complex nature of text databases and to the high dimensionality of texts representations. In this sense, this paper introduces an alternative way to process this problem. First, it starts by reducing the original set of features by using a newly proposed metric. And second, the added advantage of the proposed approach is that it automatically classifies a text without necessarily processing all its features. Moreover, some standard pretreatments such as stemming can be abandoned with this approach. The experimental results showed that this new text categorization method outperforms the state-of-the-art methods. As a result, the obtained f-measures on the 20 Newsgroups, BBC News, Reuters, and AG news datasets were, respectively, 95.06%, 98.21%, 88.44%, 95.70%, while standard approaches returned considerably lower scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Pérez-Rodríguez, G., Pérez-Pérez, M., Fdez-Riverola, F., Lourenço, A.: Online visibility of software-related web sites: the case of biomedical text mining tools. Inf. Process. Manag. 56(3), 565–583 (2019)

    Article  Google Scholar 

  2. Hartmann, J., Huppertz, J., Schamp, C., Heitmann, M.: Comparing automated text classification methods. Int. J. Res. Mark. 36(1), 20–38 (2019)

    Article  Google Scholar 

  3. Kakol, M., Nielek, R., Wierzbicki, A.: Understanding and predicting Web content credibility using the Content Credibility Corpus. Inf. Process. Manag. 53(5), 1043–1061 (2017)

    Article  Google Scholar 

  4. Ahmed, H., Traore, I., Saad, S.: Detecting opinion spams and fake news using text classification. Secur Priv 1(1), e9 (2018)

    Article  Google Scholar 

  5. Posadas-Durán, J.-P., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D., Chanona-Hernández, L.: Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput. 21(3), 627–639 (2017)

    Article  Google Scholar 

  6. Giatsoglou, M., Vozalis, M.G., Diamantaras, K., Vakali, A., Sarigiannidis, G., Chatzisavvas, K.C.: Sentiment analysis leveraging emotions and word embeddings. Expert Syst. Appl. 69, 214–224 (2017)

    Article  Google Scholar 

  7. Cherif, W., Madani, A., Kissi, M.: Towards an efficient opinion measurement in Arabic comments. Procedia Comput. Sci. 73, 122–129 (2015)

    Article  Google Scholar 

  8. Petrenz, P., Webber, B.: Stable classification of text genres. Comput. Linguist. 37(2), 385–393 (2011)

    Article  Google Scholar 

  9. Stavrianou, A., Andritsos, P., Nicoloyannis, N.: Overview and semantic issues of text mining. ACM Sigmod Rec. 36(3), 23–34 (2007)

    Article  Google Scholar 

  10. Kostkina, A., Bodunkov, D., Klimov, V.: Document categorization based on usage of features reduction with synonyms clustering in weak semantic map. Procedia Comput. Sci. 145, 288–292 (2018)

    Article  Google Scholar 

  11. Wang, R., Chen, G., Sui, X.: Multi label text classification method based on co-occurrence latent semantic vector space. Procedia Comput. Sci. 131, 756–764 (2018)

    Article  Google Scholar 

  12. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  Google Scholar 

  13. Manikandan, R., Sivakumar, R.: Machine learning algorithms for text-documents classification: a review. Mach. Learn. 3(2), 384–389 (2018)

  14. Alostad, J.M.: Dimensionality scale back in massive datasets using PDLPP. J. Comput. Sci. 26, 141–146 (2018)

    Article  MathSciNet  Google Scholar 

  15. Leopold, E., May, M., Paaß, G.: Data mining and text mining for science and technology research. In: Handbook of quantitative science and technology research, pp. 187–213. Springer, Dordrecht (2004)

  16. Virmani, D., Taneja, S.: A text preprocessing approach for efficacious information retrieval. In: Smart innovations in communication and computational sciences, pp. 13–22. Springer, Singapore (2019)

  17. Joachims, T.: A Probabilistic analysis of the rocchio algorithm with TFIDF for text categorization (No. CMU-CS-96-118). Carnegie-mellon univ pittsburgh pa dept of computer science (1996)

  18. Dogan, T., Uysal, A.K.: On term frequency factor in supervised term weighting schemes for text classification. Arab. J. Sci. Eng. 44, 1–16 (2019)

    Article  Google Scholar 

  19. Guru, D.S., Suhil, M., Raju, L.N., Kumar, N.V.: An alternative framework for univariate filter-based feature selection for text categorization. Pattern Recognit. Lett. 103, 23–31 (2018)

    Article  Google Scholar 

  20. Kim, D., Seo, D., Cho, S., Kang, P.: Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Inf. Sci. 477, 15–29 (2019)

    Article  Google Scholar 

  21. Bai, V.M.A., Manimegalai, D.: Analysis of feature selection measures for text categorization. Int. J. Enterp. Netw. Manag. 8(1), 45–60 (2017)

    Google Scholar 

  22. Lang, K.: Newsweeder: learning to filter netnews. In: Machine learning proceedings 1995, pp. 331–339. Morgan Kaufmann (1995)

  23. Maron, M.E.: Automatic indexing: an experimental inquiry. J. ACM (JACM) 8(3), 404–417 (1961)

    Article  MATH  Google Scholar 

  24. Sebastiani, F.: Text categorization. In: Encyclopedia of database technologies and applications, pp. 683–687. IGI Global (2005)

  25. Hayes, P.J., Andersen, P.M., Nirenburg, I.B., Schmandt, L.M.: Tcs: a shell for content-based text categorization. In: Sixth conference on artificial intelligence for applications, pp. 320–326. IEEE (1990)

  26. Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retrieval 1(1–2), 69–90 (1999)

    Article  Google Scholar 

  27. Xu, S.: Bayesian Naïve Bayes classifiers to text classification. J. Inf. Sci. 44(1), 48–59 (2018)

    Article  Google Scholar 

  28. Zhang, L., Jiang, L., Li, C., Kong, G.: Two feature weighting approaches for naive Bayes text classifiers. Knowl.-Based Syst. 100, 137–144 (2016)

    Article  Google Scholar 

  29. Hassaine, A., Mecheter, S., Jaoua, A.: Text categorization using hyper rectangular keyword extraction: application to news articles classification. In: International conference on relational and algebraic methods in computer science, pp. 312–325. Springer, Cham (2015)

  30. Ghareb, A.S., Bakar, A.A., Hamdan, A.R.: Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl. 49, 31–47 (2016)

    Article  Google Scholar 

  31. Nikhath, A.K., Subrahmanyam, K., Vasavi, R.: Building a K-nearest neighbor classifier for text categorization. Int. J. Comput. Sci. Inf. Technol. 7(1), 254–256 (2016)

    Google Scholar 

  32. Jo, T.: String vector based KNN for text categorization. In: 2018 20th international conference on advanced communication technology (ICACT), pp. 438–443. IEEE (2018)

  33. Yu, B., Xu, Z.B., Li, C.H.: Latent semantic analysis for text categorization using neural network. Knowl.-Based Syst. 21(8), 900–904 (2008)

    Article  Google Scholar 

  34. Ramesh, B., Sathiaseelan, J.G.R.: An advanced multi class instance selection-based support vector machine for text classification. Procedia Comput. Sci. 57, 1124–1130 (2015)

    Article  Google Scholar 

  35. Goudjil, M., Koudil, M., Bedda, M., Ghoggali, N.: A novel active learning method using SVM for text classification. Int. J. Autom. Comput. 15, 1–9 (2018)

    Article  Google Scholar 

  36. Deng, X., Li, Y., Weng, J., Zhang, J.: Feature selection for text classification: a review. Multimed. Tools Appl. 78(3), 3797–3816 (2019)

    Article  Google Scholar 

  37. Tang, X., Dai, Y., Xiang, Y.: Feature selection based on feature interactions with application to text categorization. Expert Syst. Appl. 120, 207–216 (2019)

    Article  Google Scholar 

  38. Banks, G.C., Woznyj, H.M., Wesslen, R.S., Ross, R.L.: A review of best practice recommendations for text analysis in R (and a user-friendly app). J. Bus. Psychol. 33(4), 445–459 (2018)

    Article  Google Scholar 

  39. Cherif, W., Madani, A., Kissi, M.: New rules-based algorithm to improve Arabic stemming accuracy. Int. J. Knowl. Eng. Data Min. 3(3–4), 315–336 (2015)

    Article  Google Scholar 

  40. Das, A.K., Das, A.K., Sarkar, A.: An Evolutionary Algorithm-Based Text Categorization Technique. In: Computational intelligence in data mining, pp. 851–861. Springer, Singapore (2019)

  41. Murphy, G., & Cubranic, D.: Automatic bug triage using text categorization. In: Proceedings of the sixteenth international conference on software engineering and knowledge engineering, pp. 261–272 (2004)

  42. Gupta, V., Lehal, G.S.: A survey of text mining techniques and applications. J. Emerg. Technol. Web Intell. 1(1), 60–76 (2009)

    Google Scholar 

  43. Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 6(1), 80–89 (2004)

    Article  Google Scholar 

  44. Jo, T.: K nearest neighbor for text categorization using feature similarity. In: Advanced engineering and ICT–convergence 2019 (ICAEIC-2019), p. 99 (2019)

  45. Langlois, A., Nie, J.Y., Thomas, J., Hong, Q.N., Pluye, P.: Discriminating between empirical studies and nonempirical works using automated text classification. Res. Synth. Methods 9(4), 587–601 (2018)

    Article  Google Scholar 

  46. Zhang, T., Ge, S.S.: An improved TF-IDF algorithm based on class discriminative strength for text categorization on desensitized data. In: Proceedings of the 2019 3rd international conference on innovation in artificial intelligence, pp. 39–44. ACM (2019)

  47. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

    MATH  Google Scholar 

  48. Rehman, A., Javed, K., Babri, H.A.: Feature selection based on a normalized difference measure for text classification. Inf. Process. Manag. 53(2), 473–489 (2017)

    Article  Google Scholar 

  49. Hussain, S., Keung, J., Khan, A.A., Ahmad, A., Cuomo, S., Piccialli, F., Jeon, G., Akhunzada, A.: Implications of deep learning for the automation of design patterns organization. J. Parallel Distrib. Comput. 117, 256–266 (2018)

    Article  Google Scholar 

  50. Premchander, K., Sarma, S.S.V.N., Vaishali, K., Reddy, P.V., Anjaneyulu, M., Nagaprasad, S.: WordNet-based text categorization using convolutional neural networks. In: Proceedings of International Conference on Recent Advancement on Computer and Communication, pp. 243–251. Springer, Singapore (2018)

  51. Tao, X., Yaling, W., Nan, M.: Convolutional neural network based on word sense disambiguation for text classification. Appl. Res. Comput. 5, 10 (2018)

    Google Scholar 

  52. Wang, X., Kim, H.C.: Text categorization with improved deep learning methods. J. Inf. Commun. Converg. Eng. 16(2), 106–113 (2018)

    Google Scholar 

  53. Škrlj, B., Kralj, J., Lavrač, N., Pollak, S.: Towards robust text classification with semantics-aware recurrent neural architecture. Mach. Learn. Knowl. Extr. 1(2), 575–589 (2019)

    Article  Google Scholar 

  54. Jiang, M., Liang, Y., Feng, X., Fan, X., Pei, Z., Xue, Y., Guan, R.: Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29(1), 61–70 (2018)

    Article  Google Scholar 

  55. Tellez, E.S., Moctezuma, D., Miranda-Jiménez, S., Graff, M.: An automated text categorization framework based on hyperparameter optimization. Knowl.-Based Syst. 149, 110–123 (2018)

    Article  Google Scholar 

  56. Shah, F.P., Patel, V.: A review on feature selection and feature extraction for text classification. In: 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), pp. 2264–2268. IEEE (2016)

  57. Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd International Conference on Machine learning, pp. 377–384 (2006)

  58. Leopold, E., Kindermann, J.: Text categorization with support vector machines. How to represent texts in input space? Mach. Learn. 46(1–3), 423–444 (2002)

    Article  MATH  Google Scholar 

  59. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)

  60. Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–437 (2009)

    Article  Google Scholar 

  61. Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: 2015 IEEE 14th International Conference on Cognitive Informatics and Cognitive Computing (ICCI* CC), pp. 136–140 (2015)

  62. Labani, M., Moradi, P., Ahmadizar, F., Jalili, M.: A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70, 25–37 (2018)

    Article  Google Scholar 

  63. Bramesh, S.M., Kumar, K.A.: Empirical study to evaluate the performance of classification algorithms on public datasets. In: Emerging Research in Electronics, Computer Science and Technology, pp. 447–455. Springer, Singapore (2019)

  64. Chowdhury, S.B.R., Annervaz, K.M., Dukkipati, A.: Instance-based inductive deep transfer learning by cross-dataset querying with locality sensitive hashing (2018)

  65. Pappagari, R., Villalba, J., Dehak, N.: Joint verification-identification in end-to-end multi-scale CNN framework for topic identification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203 (2018)

  66. Kadhim, A.I., Cheah, Y.N., Ahamed, N.H.: Text document preprocessing and dimension reduction techniques for text document clustering. In: 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology, pp. 69–73. IEEE (2014)

  67. Camacho-Collados, J., Pilehvar, M.T.: On the role of text preprocessing in neural network architectures: an evaluation study on text categorization and sentiment analysis (2017). arXiv:1707.01780

  68. Asim, M.N., Khan, M.U.G., Malik, M.I., Dengel, A., Ahmed, S.: A robust hybrid approach for textual document classification. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1390–1396. IEEE (2019)

  69. Elghannam, F.: Text representation and classification based on bi-gram alphabet. J. King Saud Univ. Comput. Inf. Sci. 33(2), 235–242 (2021)

  70. Pradhan, L., Taneja, N.A., Dixit, C., Suhag, M.: Comparison of text classifiers on news articles. Int. Res. J. Eng. Technol. 4(3), 2513–2517 (2017)

    Google Scholar 

  71. Aziguli, W., Zhang, Y., Xie, Y., Zhang, D., Luo, X., Li, C., & Zhang, Y.: A robust text classifier based on denoising deep neural network in the analysis of big data. Sci. Program. 2017(1), 3610378 (2017)

  72. Al-Salemi, B., Ayob, M., Noah, S.A.M.: Feature ranking for enhancing boosting-based multi-label text categorization. Expert Syst. Appl. 113, 531–543 (2018)

    Article  Google Scholar 

  73. Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K.: An kNN model-based approach and its application in text categorization. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 559–570. Springer, Berlin, Heidelberg (2004)

  74. Yogatama, D., Dyer, C., Ling, W., Blunsom, P.: Generative and discriminative text classification with recurrent neural networks (2017). arXiv:1703.01898

  75. Wang, J., Wang, Z., Zhang, D., Yan, J.: Combining knowledge with deep convolutional neural networks for short text classification. In: IJCAI, vol. 350 (2017)

  76. Wang, B.: Disconnected recurrent neural networks for text categorization. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2311–2320 (2018)

  77. Marivate, V., Sefara, T.: Improving short text classification through global augmentation methods. In: International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pp. 385–399. Springer, Cham (2020)

  78. Khalifi, H., Cherif, W., El Qadi, A., Ghanou, Y.: Query expansion based on clustering and personalized information retrieval. Prog. Artif. Intell. 8(2), 241–251 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Walid Cherif.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cherif, W., Madani, A. & Kissi, M. Text categorization based on a new classification by thresholds. Prog Artif Intell 10, 433–447 (2021). https://doi.org/10.1007/s13748-021-00247-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-021-00247-1

Keywords