Abstract
One of the important tasks in natural language processing is speech classification by domain. As shown in the literature, no prior studies have addressed this problem, specially the effect of using root N-grams and stem N-grams on Arabic speech classification performance. In this paper we describe a study for Arabic spoken documents classification, using the K-Nearest Neighbor, the Naive Bayes and the Support Vector Machine. We create a speech recognition system for the transcription of Arabic audio files. Then, we use four types of features: 1-gram, 2-gram and 3-gram word roots or stems as well as full words. The obtained results show that, compared to stem or word N-grams, the use of a 1-gram root as a feature provides greater classification performance for Arabic speech classification. It is that classification performance decreases whenever the number of N-grams increases. The data also exhibit that the support vector machine outperforms the Naïve Bayes and the k-nearest neighbor with 1 gram. Whenever the k-nearest neighbor is used, the 2-gram root achieves the best performance. The 3-gram root, on the other hand, achieves the best performance whenever the support vector machine was used.
References
Abbas, M., et al.: Evaluation of topic identification methods on Arabic corpora. JDIM 9(5), 185–192 (2011)
Al-Badarneh, A., et al.: The impact of indexing approaches on Arabic text classification. J. Inf. Sci. 43(2), 159–173 (2017)
Ali, M., et al.: Arabic phonetic dictionaries for speech recognition. J. Inf. Technol. Res. 2(4), 67–80 (2009)
Aljlayl, M., Frieder, O.: On Arabic search: improving the retrieval effectiveness via a light stemming approach. In: Proceedings of 11th International Conference on Information and Knowledge Management, pp. 340–347. ACM (2002)
Al-Kabi, M., et al.: The effect of stemming on Arabic text classification: an empirical study. In: Information Retrieval Methods for Multidisciplinary Applications, p. 207 (2013)
Al-Molegi, A., et al.: Automatic learning of arabic text categorization. Int. J. Digit. Contents Appl. 2(1), 1–16 (2015)
Al-Shalabi, R., Obeidat, R.: Improving KNN Arabic text classification with n-grams based document indexing. In: Proceedings of 6th International Conference on Informatics and Systems, Cairo, Egypt, pp. 108–112 (2008)
Al-Thubaity, A., Alhoshan, M., Hazzaa, I.: Using word n-grams as features in Arabic text classification. In: Lee, R. (ed.) Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. SCI, vol. 569, pp. 35–43. Springer, Cham (2015). doi:10.1007/978-3-319-10389-1_3
Ayadi, R., Maraoui, M., Zrigui, M.: LDA and LSI as a dimensionality reduction method in arabic document classification. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2015. CCIS, vol. 538, pp. 491–502. Springer, Cham (2015). doi:10.1007/978-3-319-24770-0_42
Barigou, F.: Improving K-nearest neighbor efficiency for text categorization. Neural Netw. World 26(1), 45 (2016)
Dai, P., et al.: A novel feature combination approach for spoken document classification with support vector machines. In: Proceedings of Multimedia Information Retrieval Workshop, pp. 1–5 (2003)
Duwairi, R., et al.: Feature reduction techniques for Arabic text categorization. J. Am. Soc. Inf. Sci. Technol. 60(11), 2347–2352 (2009)
Harrag, F., et al.: Improving Arabic text categorization using decision trees. In: 2009 1st International Conference on Networked Digital Technologies, NDT 2009, pp. 110–115. IEEE (2009)
Khoja, S., Garside, R.: Stemming Arabic text. Computing Department, Lancaster University, Lancaster, UK (1999)
Lamere, P., et al.: Design of the CMU sphinx-4 decoder. In: INTERSPEECH (2003)
Lee, D.L., et al.: Document ranking and the vector-space model. IEEE Softw. 14(2), 67–75 (1997)
Mesleh, A.M.: Support vector machines based Arabic language text classification system: feature selection comparative study. In: Sobh, T. (ed.) Advances in Computer and Information Sciences and Engineering, pp. 11–16. Springer, Dordrecht (2008). doi:10.1007/978-1-4020-8741-7_3
Noaman, H.M., et al.: Naive Bayes classifier based Arabic document categorization. In: 2010 7th International Conference on Informatics and Systems (INFOS), pp. 1–5. IEEE (2010)
Pilászy, I.: Text categorization and support vector machines. In: Proceedings of 6th International Symposium of Hungarian Researchers on Computational Intelligence (2005)
Qamar, A.M., et al.: Similarity learning for nearest neighbor classification. In: 2008 8th IEEE International Conference on Data Mining, ICDM 2008, pp. 983–988. IEEE (2008)
Saad, M.K., Ashour, W.: Arabic morphological tools for text mining. Corpora 18, 19 (2010)
Schneider, K.-M.: Techniques for improving the performance of Naive Bayes for text classification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 682–693. Springer, Heidelberg (2005). doi:10.1007/978-3-540-30586-6_76
Singh, S.R., et al.: Feature selection for text classification based on Gini coefficient of inequality. In: FSDM, vol. 10, pp. 76–85 (2010)
Stolcke, A., et al.: SRILM-an extensible language modeling toolkit. In: Interspeech, vol. 2002 (2002)
Zerrouki, T., Balla, A.: Tashkeela: novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data Brief 11, 147–151 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Labidi, M., Maraoui, M., Zrigui, M. (2017). Study for Automatic Classification of Arabic Spoken Documents. In: Nguyen, N., Papadopoulos, G., Jędrzejowicz, P., Trawiński, B., Vossen, G. (eds) Computational Collective Intelligence. ICCCI 2017. Lecture Notes in Computer Science(), vol 10449. Springer, Cham. https://doi.org/10.1007/978-3-319-67077-5_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-67077-5_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67076-8
Online ISBN: 978-3-319-67077-5
eBook Packages: Computer ScienceComputer Science (R0)