Study for Automatic Classification of Arabic Spoken Documents

Labidi, Mohamed; Maraoui, Mohsen; Zrigui, Mounir

doi:10.1007/978-3-319-67077-5_44

Mohamed Labidi¹⁸,
Mohsen Maraoui¹⁹ &
Mounir Zrigui¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10449))

Included in the following conference series:

International Conference on Computational Collective Intelligence

1804 Accesses

Abstract

One of the important tasks in natural language processing is speech classification by domain. As shown in the literature, no prior studies have addressed this problem, specially the effect of using root N-grams and stem N-grams on Arabic speech classification performance. In this paper we describe a study for Arabic spoken documents classification, using the K-Nearest Neighbor, the Naive Bayes and the Support Vector Machine. We create a speech recognition system for the transcription of Arabic audio files. Then, we use four types of features: 1-gram, 2-gram and 3-gram word roots or stems as well as full words. The obtained results show that, compared to stem or word N-grams, the use of a 1-gram root as a feature provides greater classification performance for Arabic speech classification. It is that classification performance decreases whenever the number of N-grams increases. The data also exhibit that the support vector machine outperforms the Naïve Bayes and the k-nearest neighbor with 1 gram. Whenever the k-nearest neighbor is used, the 2-gram root achieves the best performance. The 3-gram root, on the other hand, achieves the best performance whenever the support vector machine was used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Abbas, M., et al.: Evaluation of topic identification methods on Arabic corpora. JDIM 9(5), 185–192 (2011)
Google Scholar
Al-Badarneh, A., et al.: The impact of indexing approaches on Arabic text classification. J. Inf. Sci. 43(2), 159–173 (2017)
Article Google Scholar
Ali, M., et al.: Arabic phonetic dictionaries for speech recognition. J. Inf. Technol. Res. 2(4), 67–80 (2009)
Article Google Scholar
Aljlayl, M., Frieder, O.: On Arabic search: improving the retrieval effectiveness via a light stemming approach. In: Proceedings of 11th International Conference on Information and Knowledge Management, pp. 340–347. ACM (2002)
Google Scholar
Al-Kabi, M., et al.: The effect of stemming on Arabic text classification: an empirical study. In: Information Retrieval Methods for Multidisciplinary Applications, p. 207 (2013)
Google Scholar
Al-Molegi, A., et al.: Automatic learning of arabic text categorization. Int. J. Digit. Contents Appl. 2(1), 1–16 (2015)
Google Scholar
Al-Shalabi, R., Obeidat, R.: Improving KNN Arabic text classification with n-grams based document indexing. In: Proceedings of 6th International Conference on Informatics and Systems, Cairo, Egypt, pp. 108–112 (2008)
Google Scholar
Al-Thubaity, A., Alhoshan, M., Hazzaa, I.: Using word n-grams as features in Arabic text classification. In: Lee, R. (ed.) Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. SCI, vol. 569, pp. 35–43. Springer, Cham (2015). doi:10.1007/978-3-319-10389-1_3
Chapter Google Scholar
Ayadi, R., Maraoui, M., Zrigui, M.: LDA and LSI as a dimensionality reduction method in arabic document classification. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2015. CCIS, vol. 538, pp. 491–502. Springer, Cham (2015). doi:10.1007/978-3-319-24770-0_42
Chapter Google Scholar
Barigou, F.: Improving K-nearest neighbor efficiency for text categorization. Neural Netw. World 26(1), 45 (2016)
Article Google Scholar
Dai, P., et al.: A novel feature combination approach for spoken document classification with support vector machines. In: Proceedings of Multimedia Information Retrieval Workshop, pp. 1–5 (2003)
Google Scholar
Duwairi, R., et al.: Feature reduction techniques for Arabic text categorization. J. Am. Soc. Inf. Sci. Technol. 60(11), 2347–2352 (2009)
Article Google Scholar
Harrag, F., et al.: Improving Arabic text categorization using decision trees. In: 2009 1st International Conference on Networked Digital Technologies, NDT 2009, pp. 110–115. IEEE (2009)
Google Scholar
Khoja, S., Garside, R.: Stemming Arabic text. Computing Department, Lancaster University, Lancaster, UK (1999)
Google Scholar
Lamere, P., et al.: Design of the CMU sphinx-4 decoder. In: INTERSPEECH (2003)
Google Scholar
Lee, D.L., et al.: Document ranking and the vector-space model. IEEE Softw. 14(2), 67–75 (1997)
Article Google Scholar
Mesleh, A.M.: Support vector machines based Arabic language text classification system: feature selection comparative study. In: Sobh, T. (ed.) Advances in Computer and Information Sciences and Engineering, pp. 11–16. Springer, Dordrecht (2008). doi:10.1007/978-1-4020-8741-7_3
Chapter Google Scholar
Noaman, H.M., et al.: Naive Bayes classifier based Arabic document categorization. In: 2010 7th International Conference on Informatics and Systems (INFOS), pp. 1–5. IEEE (2010)
Google Scholar
Pilászy, I.: Text categorization and support vector machines. In: Proceedings of 6th International Symposium of Hungarian Researchers on Computational Intelligence (2005)
Google Scholar
Qamar, A.M., et al.: Similarity learning for nearest neighbor classification. In: 2008 8th IEEE International Conference on Data Mining, ICDM 2008, pp. 983–988. IEEE (2008)
Google Scholar
Saad, M.K., Ashour, W.: Arabic morphological tools for text mining. Corpora 18, 19 (2010)
Google Scholar
Schneider, K.-M.: Techniques for improving the performance of Naive Bayes for text classification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 682–693. Springer, Heidelberg (2005). doi:10.1007/978-3-540-30586-6_76
Chapter Google Scholar
Singh, S.R., et al.: Feature selection for text classification based on Gini coefficient of inequality. In: FSDM, vol. 10, pp. 76–85 (2010)
Google Scholar
Stolcke, A., et al.: SRILM-an extensible language modeling toolkit. In: Interspeech, vol. 2002 (2002)
Google Scholar
Zerrouki, T., Balla, A.: Tashkeela: novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data Brief 11, 147–151 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Research Laboratory of Technologies of Information and Communication and Electrical Engineering, Tunis, Tunisia
Mohamed Labidi & Mounir Zrigui
Computational Mathematics Laboratory, Monastir, Tunisia
Mohsen Maraoui

Authors

Mohamed Labidi
View author publications
You can also search for this author in PubMed Google Scholar
Mohsen Maraoui
View author publications
You can also search for this author in PubMed Google Scholar
Mounir Zrigui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Labidi .

Editor information

Editors and Affiliations

Department of Information Systems, Faculty of Computer Science and Management, Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
Department of Computer Science, University of Cyprus, Nicosia, Cyprus
George A. Papadopoulos
Department of Information Systems, Gdynia Maritime University, Gdynia, Poland
Piotr Jędrzejowicz
Department of Information Systems, Faculty of Computer Science and Management, Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński
Department of Information Systems, University of Münster, Münster, Germany
Gottfried Vossen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Labidi, M., Maraoui, M., Zrigui, M. (2017). Study for Automatic Classification of Arabic Spoken Documents. In: Nguyen, N., Papadopoulos, G., Jędrzejowicz, P., Trawiński, B., Vossen, G. (eds) Computational Collective Intelligence. ICCCI 2017. Lecture Notes in Computer Science(), vol 10449. Springer, Cham. https://doi.org/10.1007/978-3-319-67077-5_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-67077-5_44
Published: 07 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67076-8
Online ISBN: 978-3-319-67077-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics