Skip to main content

Study for Automatic Classification of Arabic Spoken Documents

  • Conference paper
  • First Online:
Computational Collective Intelligence (ICCCI 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10449))

Included in the following conference series:

  • 1804 Accesses

Abstract

One of the important tasks in natural language processing is speech classification by domain. As shown in the literature, no prior studies have addressed this problem, specially the effect of using root N-grams and stem N-grams on Arabic speech classification performance. In this paper we describe a study for Arabic spoken documents classification, using the K-Nearest Neighbor, the Naive Bayes and the Support Vector Machine. We create a speech recognition system for the transcription of Arabic audio files. Then, we use four types of features: 1-gram, 2-gram and 3-gram word roots or stems as well as full words. The obtained results show that, compared to stem or word N-grams, the use of a 1-gram root as a feature provides greater classification performance for Arabic speech classification. It is that classification performance decreases whenever the number of N-grams increases. The data also exhibit that the support vector machine outperforms the Naïve Bayes and the k-nearest neighbor with 1 gram. Whenever the k-nearest neighbor is used, the 2-gram root achieves the best performance. The 3-gram root, on the other hand, achieves the best performance whenever the support vector machine was used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Abbas, M., et al.: Evaluation of topic identification methods on Arabic corpora. JDIM 9(5), 185–192 (2011)

    Google Scholar 

  2. Al-Badarneh, A., et al.: The impact of indexing approaches on Arabic text classification. J. Inf. Sci. 43(2), 159–173 (2017)

    Article  Google Scholar 

  3. Ali, M., et al.: Arabic phonetic dictionaries for speech recognition. J. Inf. Technol. Res. 2(4), 67–80 (2009)

    Article  Google Scholar 

  4. Aljlayl, M., Frieder, O.: On Arabic search: improving the retrieval effectiveness via a light stemming approach. In: Proceedings of 11th International Conference on Information and Knowledge Management, pp. 340–347. ACM (2002)

    Google Scholar 

  5. Al-Kabi, M., et al.: The effect of stemming on Arabic text classification: an empirical study. In: Information Retrieval Methods for Multidisciplinary Applications, p. 207 (2013)

    Google Scholar 

  6. Al-Molegi, A., et al.: Automatic learning of arabic text categorization. Int. J. Digit. Contents Appl. 2(1), 1–16 (2015)

    Google Scholar 

  7. Al-Shalabi, R., Obeidat, R.: Improving KNN Arabic text classification with n-grams based document indexing. In: Proceedings of 6th International Conference on Informatics and Systems, Cairo, Egypt, pp. 108–112 (2008)

    Google Scholar 

  8. Al-Thubaity, A., Alhoshan, M., Hazzaa, I.: Using word n-grams as features in Arabic text classification. In: Lee, R. (ed.) Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. SCI, vol. 569, pp. 35–43. Springer, Cham (2015). doi:10.1007/978-3-319-10389-1_3

    Chapter  Google Scholar 

  9. Ayadi, R., Maraoui, M., Zrigui, M.: LDA and LSI as a dimensionality reduction method in arabic document classification. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2015. CCIS, vol. 538, pp. 491–502. Springer, Cham (2015). doi:10.1007/978-3-319-24770-0_42

    Chapter  Google Scholar 

  10. Barigou, F.: Improving K-nearest neighbor efficiency for text categorization. Neural Netw. World 26(1), 45 (2016)

    Article  Google Scholar 

  11. Dai, P., et al.: A novel feature combination approach for spoken document classification with support vector machines. In: Proceedings of Multimedia Information Retrieval Workshop, pp. 1–5 (2003)

    Google Scholar 

  12. Duwairi, R., et al.: Feature reduction techniques for Arabic text categorization. J. Am. Soc. Inf. Sci. Technol. 60(11), 2347–2352 (2009)

    Article  Google Scholar 

  13. Harrag, F., et al.: Improving Arabic text categorization using decision trees. In: 2009 1st International Conference on Networked Digital Technologies, NDT 2009, pp. 110–115. IEEE (2009)

    Google Scholar 

  14. Khoja, S., Garside, R.: Stemming Arabic text. Computing Department, Lancaster University, Lancaster, UK (1999)

    Google Scholar 

  15. Lamere, P., et al.: Design of the CMU sphinx-4 decoder. In: INTERSPEECH (2003)

    Google Scholar 

  16. Lee, D.L., et al.: Document ranking and the vector-space model. IEEE Softw. 14(2), 67–75 (1997)

    Article  Google Scholar 

  17. Mesleh, A.M.: Support vector machines based Arabic language text classification system: feature selection comparative study. In: Sobh, T. (ed.) Advances in Computer and Information Sciences and Engineering, pp. 11–16. Springer, Dordrecht (2008). doi:10.1007/978-1-4020-8741-7_3

    Chapter  Google Scholar 

  18. Noaman, H.M., et al.: Naive Bayes classifier based Arabic document categorization. In: 2010 7th International Conference on Informatics and Systems (INFOS), pp. 1–5. IEEE (2010)

    Google Scholar 

  19. Pilászy, I.: Text categorization and support vector machines. In: Proceedings of 6th International Symposium of Hungarian Researchers on Computational Intelligence (2005)

    Google Scholar 

  20. Qamar, A.M., et al.: Similarity learning for nearest neighbor classification. In: 2008 8th IEEE International Conference on Data Mining, ICDM 2008, pp. 983–988. IEEE (2008)

    Google Scholar 

  21. Saad, M.K., Ashour, W.: Arabic morphological tools for text mining. Corpora 18, 19 (2010)

    Google Scholar 

  22. Schneider, K.-M.: Techniques for improving the performance of Naive Bayes for text classification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 682–693. Springer, Heidelberg (2005). doi:10.1007/978-3-540-30586-6_76

    Chapter  Google Scholar 

  23. Singh, S.R., et al.: Feature selection for text classification based on Gini coefficient of inequality. In: FSDM, vol. 10, pp. 76–85 (2010)

    Google Scholar 

  24. Stolcke, A., et al.: SRILM-an extensible language modeling toolkit. In: Interspeech, vol. 2002 (2002)

    Google Scholar 

  25. Zerrouki, T., Balla, A.: Tashkeela: novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data Brief 11, 147–151 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Labidi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Labidi, M., Maraoui, M., Zrigui, M. (2017). Study for Automatic Classification of Arabic Spoken Documents. In: Nguyen, N., Papadopoulos, G., Jędrzejowicz, P., Trawiński, B., Vossen, G. (eds) Computational Collective Intelligence. ICCCI 2017. Lecture Notes in Computer Science(), vol 10449. Springer, Cham. https://doi.org/10.1007/978-3-319-67077-5_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67077-5_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67076-8

  • Online ISBN: 978-3-319-67077-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics