Skip to main content

Combining Words and Concepts for Automatic Arabic Text Classification

  • Conference paper
  • First Online:
Arabic Language Processing: From Theory to Practice (ICALP 2017)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 782))

Included in the following conference series:

  • 972 Accesses

Abstract

The paper examines combining words and concepts for text representation for Arabic Automatic Text Classification (ATC) and its impact on the accuracy of the classification, when used with various stemming methods and classifiers. An experimental Arabic ATC system was developed and the effects of its main components on the classification accuracy are assessed. Firstly, variants of the standard Bag-of-Words model with different stemming methods are examined and compared. Arabic Wikipedia and WordNet were examined and compared for providing concepts for effective Bag-of-Concepts representation. Based on this, Wikipedia was then utilized to provide concepts, and different strategies for combining words and concepts, including two new in-house developed approaches, were examined for effective Arabic text representation in terms of their impact on the overall classification accuracy. Our experimental results show that text representation is a key element in the performance of Arabic ATC, and combining words and concepts to represent Arabic text enhances the classification accuracy as compared to using words or concepts alone.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)

    Article  MATH  Google Scholar 

  2. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48 (1998)

    Google Scholar 

  3. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34, 1–47 (2002)

    Article  Google Scholar 

  4. Hotho, A., Staab, S., Stumme, G.: Wordnet improves Text Document Clustering (2003)

    Google Scholar 

  5. Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: IJCAI, vol. 5, pp. 1048–1053 (2005)

    Google Scholar 

  6. Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: AAAI, vol. 6, pp. 1301–1306 (2006)

    Google Scholar 

  7. Kehagias, A., Petridis, V., Kaburlasos, V.G., Fragkou, P.: A comparison of word-and sense-based text categorization using several classification algorithms. J. Intell. Inf. Syst. 21, 227–247 (2003)

    Article  Google Scholar 

  8. de Buenaga Rodríguez, M., Hidalgo, J.M.G., Agudo, B.D.: Using WordNet to complement training information in text categorization. arXiv preprint cmp-lg/9709007 (1997)

  9. Scott, S., Matwin, S.: Text classification using WordNet hypernyms. In: Use of WordNet in Natural Language Processing Systems, Proceedings of the Conference, pp. 38–44 (1998)

    Google Scholar 

  10. Wang, P., Hu, J., Zeng, H.-J., Chen, L., Chen, Z.: Improving text classification by using encyclopedia knowledge, pp. 332–341 (2007)

    Google Scholar 

  11. Wang, P., Hu, J., Zeng, H.-J., Chen, Z.: Using Wikipedia knowledge to improve text classification. Knowl. Inf. Syst. 19, 265–281 (2008)

    Article  Google Scholar 

  12. Benkhalifa, M., Mouradi, A., Bouyakhf, H.: Integrating external knowledge to supplement training data in semi-supervised learning for text categorization. Inf. Retr. 4, 91–113 (2001)

    Article  MATH  Google Scholar 

  13. Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 179–186. ACM (2008)

    Google Scholar 

  14. Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering, pp. 541–544 (2003)

    Google Scholar 

  15. Harrag, F., El-Qawasmah, E., Al-Salman, A.M.S.: Stemming as a feature reduction technique for arabic text categorization. In: 2011 10th International Symposium on Programming and Systems (ISPS), pp. 128–133. IEEE (2011)

    Google Scholar 

  16. Syiam, M.M., Fayed, Z.T., Habib, M.B.: An intelligent system for Arabic text categorization. Int. J. Intell. Comput. Inf. Sci. 6, 1–19 (2006)

    Article  Google Scholar 

  17. Darwish, K., Oard, D.W.: Adapting morphology for Arabic information retrieval*. In: Soudi, A., van den Bosch, A., Neumann, G. (eds.) Arabic Computational Morphology. TLTB, vol. 38, pp. 245–262. Springer, Dordrecht (2007). https://doi.org/10.1007/978-1-4020-6046-5_13

    Chapter  Google Scholar 

  18. Al-Shammari, E.T.: Improving Arabic document categorization: introducing local stem. In: 2010 10th International Conference on Intelligent Systems Design and Applications (ISDA), pp. 385–390. IEEE (2010)

    Google Scholar 

  19. Larkey, L.S., Ballesteros, L., Connell, M.E.: Light stemming for Arabic information retrieval. In: Soudi, A., van den Bosch, A., Neumann, G. (eds.) Arabic Computational Morphology, vol. 38, pp. 221–243. Springer, Dordrecht (2007). https://doi.org/10.1007/978-1-4020-6046-5_12

    Chapter  Google Scholar 

  20. Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M., Al-Rajeh, A.: Automatic Arabic text classification (2008)

    Google Scholar 

  21. Moh'd A Mesleh, A.: Chi square feature extraction based SVMs Arabic language text categorization system. J. Comput. Sci. 3, 430–435 (2007)

    Article  Google Scholar 

  22. Kanaan, G., Al-Shalabi, R., Ghwanmeh, S., Al-Ma’adeed, H.: A comparison of text-classification techniques applied to Arabic text. J. Am. Soc. Inform. Sci. Technol. 60, 1836–1844 (2009)

    Article  Google Scholar 

  23. Larkey, L.S., Ballesteros, L., Connell, M.E.: Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–282. ACM (2002)

    Google Scholar 

  24. Alsaleem, S.: Automated Arabic text categorization using SVM and NB. Int. Arab J. e-Technol. 2, 124–128 (2011)

    Google Scholar 

  25. Khreisat, L.: A machine learning approach for Arabic text classification using N-gram frequency statistics. J. Informetr. 3, 72–77 (2009)

    Article  Google Scholar 

  26. Khoja, S., Garside, R.: Stemming arabic text. Computing Department, Lancaster University, Lancaster, UK (1999)

    Google Scholar 

  27. Al-Shalabi, R., Obeidat, R.: Improving KNN Arabic text classification with n-grams based document indexing. In: Proceedings of the Sixth International Conference on Informatics and Systems, Cairo, Egypt, pp. 108–112. Citeseer (2008)

    Google Scholar 

  28. Elberrichi, Z., Abidi, K.: Arabic text categorization: a comparative study of different representation modes. Int. Arab J. Inf. Technol. (IAJIT) 9, 465–470 (2012)

    Google Scholar 

  29. Yousif, S.A., Samawi, V.W., Elkabani, I., Zantout, R.: The Effect of Combining Different Semantic Relations on Arabic Text Classification

    Google Scholar 

  30. Saad, M.K., Ashour, W.: Osac: open source arabic corpora. In: 6th ArchEng International Symposiums, EEECS, vol. 10 (2010)

    Google Scholar 

  31. Milne, D., Witten, I.H.: An open-source toolkit for mining Wikipedia. Artif. Intell. 194, 222–239 (2013)

    Article  MathSciNet  Google Scholar 

  32. Abbas, M., Smaili, K.: Comparison of topic identification methods for arabic language. In: Proceedings of International Conference on Recent Advances in Natural Language Processing, RANLP, pp. 14–17 (2005)

    Google Scholar 

  33. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11, 10–18 (2009)

    Article  Google Scholar 

  34. Ben-Hur, A., Weston, J.: A user’s guide to support vector machines. In: Carugo, O., Eisenhaber, F. (eds.) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol. 609, pp. 223–239. Humana Press, New York (2010). https://doi.org/10.1007/978-1-60327-241-4_13

    Chapter  Google Scholar 

  35. Gabrilovich, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. J. Artif. Intell. Res. 34, 443–498 (2009)

    MATH  Google Scholar 

  36. Duwairi, R., Al-Refai, M.N., Khasawneh, N.: Feature reduction techniques for Arabic text categorization. J. Am. Soc. Inform. Sci. Technol. 60, 2347–2352 (2009)

    Article  Google Scholar 

  37. Saad, M.K.: The impact of text preprocessing and term weighting on Arabic text classification. The Islamic University-Gaza (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdulhussain E. Mahdi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alahmadi, A., Joorabchi, A., Mahdi, A.E. (2018). Combining Words and Concepts for Automatic Arabic Text Classification. In: Lachkar, A., Bouzoubaa, K., Mazroui, A., Hamdani, A., Lekhouaja, A. (eds) Arabic Language Processing: From Theory to Practice. ICALP 2017. Communications in Computer and Information Science, vol 782. Springer, Cham. https://doi.org/10.1007/978-3-319-73500-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73500-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73499-6

  • Online ISBN: 978-3-319-73500-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics