Abstract
Arabic language is one of six United Nations official languages. Arabic language processing, in particular speech synthesis, is a challenging task due to the inherent complexity of the language text and characters and because each letter may have up to seven different sounds. In this paper, we provide subjective and objective evaluation for six different speech synthesizer applications available on the Internet for Arabic language namely: Acapela, ISpeech, Arabi, Sakhr, Google, and Nuance. In the case of subjective evaluation the authors performed four intelligibility tests: Diagnostic Rhyme, Modified Rhyme, Phonetically Confusable Sentences. The fourth test is proposed by the authors, Automatic Diacritization Intelligibility (ADI) which is used to test the intelligibility of the speech engine in predicting the diacritization mark according to the word context in the statement. Another two tests were performed to evaluate other features of the speech engines are: first, Arabic Text with All Sounds (ATAS) test which is used to evaluate different features when the speech engine reads Arabic text that contains all sounds for different Arabic letters. Second, Best/Worst Pleasant Voice this test is proposed by the authors to determine the best and worst speech engine in terms of the voice pleasantness. The other type of evaluation conducted is objective evaluation we evaluate the output of the six systems objectively and compare the results with the subjective evaluations performed. Such comparison is achieved by computing some objective metrics from the signals of both the generated sound by the systems and a reference one (i.e., the same text is spoken by a human). Two types of measurements are used as the objective metrics; signal to noise variation (segmented SNR) and a linear predictive (LP-based) measure. The originality of the evaluation is that it is based on using an Arabic text (diacritized and non-diacritized) containing all sounds of Arabic letters. Another novelty is that we introduced two tests ADI and ATAS tests for Arabic speech synthesizers evaluation. The result from subject users are provided to measure clearness/naturalness, speed, sound quality, pronunciation, clearness, stress/intonation, pronunciation errors, intelligibility, and pleasantness. In addition, results from experts are presented to measure the articulation of each sound, number of not pronounced words, and the speed of reading. The obtained results reveal the need to have speech synthesizers for Arabic language that considers diacritization to enhance the performance of the system. It points also to the importance of having an accurate automatic diacritization system that generates a diacritized text to be synthesized. The results show the significance of having a human similar voice for the speech synthesizer. We proposed a set of recommendations for improving Arabic speech synthesizers.
Similar content being viewed by others
References
Abdel-Hamid, O., Abdou, S. M., & Rashwan, M. (2006). Improving arabic hmm based speech synthesis quality. INTERSPEECH.
Acapela speech synthesizer. (2014). World Wide Web electronic publication. http://www.acapela-group.com/text-to-speech-interactive-demo.html.
Ahmad, J. (2007). Optical character recognition system for arabic text using cursive multi-directional approach. Journal of Computer Science, 3, 549–555.
Ali, M. E. M., Al-Muhtaseb, H., & Al-Ghamdi, M. (2007). Automatic segmentation of arabic speech. In Workshop on information technology and islamic sciences, Imam Mohammad Ben Saud University, Riyadh, March.
AlKhateeb, J., H. Ren, J., Ipson, S., & Jiang, J. (2008). knowledge-based baseline detection and optimal thresholding for words segmentation in efficient preprocessing of handwritten arabic text. In Fifth international conference on information technology: New generations (pp. 1158–1159).
Al-Saud, N. B., & Al-Khalifa, H. S. (2012). An initial comparative study of arabic speech synthesis engines in ios and android: Proceedings of the 14th international conference on information integration and web-based applications & services, IIWAS ’12 (pp. 411–414). New York, NY: ACM.
Al-Wabil, A., Al-Khalifa, H., & Al-Saleh, W. (2007). Arabic text-to-speech synthesis: A preliminary evaluation. In C. Montgomerie & J. Seale (Eds.), Proceedings of world conference on educational multimedia, hypermedia and telecommunications 2007 (pp. 4423–4430). Vancouver: AACE.
Alyazeed, M. A., Al-Ghoneimy, M. R., & Mohammad, M. (1989). Comparison of syllable and sub-syllable methods for speech synthesis. In Proceedings of the second conference on arabic computational linguistics, Kuwait.
Arabi, automatic arabic text to speech system. (2014). World Wide Web electronic publication. http://www.arabinlp.com/Systems/Demo_SystemsTTS.php?pageLang=en.
Assaf, M. (2005). A prototype of an arabic diphone speech synthesizer in festival. Master’s thesis, Uppsala University.
Atallah, A. S., & Omar, K. (2008). Methods of arabic language baseline detection the state of art. International Journal of Computer Science and Network Security (IJCSNS), 8, 137–143.
Bennett, C. L. (2005). Large scale evaluation of corpus-based synthesizers:results and lessons from the blizzard challenge 2005. In Proceedings of interspeech 2005, Lisbon.
Black, A. W., & Tokuda, K. (2005). The blizzard challenge 2005: Evaluating corpus-based speech synthesis on common datasets. In Proceedings of interspeech 2005 (pp. 77–80). Lisbon.
Borovikov, E., & Zavorin, I. (2012). A multi-stage approach to arabic document analysis. In V. Margner & H. El Abed (Eds.), Guide to OCR for Arabic scripts (pp. 55–78). London: Springer.
Campbell, N. (2007). Evaluation of speech synthesis. In L. Dybkjaer & H. Minker (Eds.), Evaluation of text and speech systems. From reading machines to talking machines. Dordrecht: Springer.
Chabchoub, A., & Cherif, A. (2011). An automatic mbrola tool for high quality arabic speech synthesis. International Journal of Computer Applications, 36(1):1–5. Published by Foundation of Computer Science, New York, USA.
Clark, R. A. J., Podsiadso, M., Fraser, M., Mayo, C., & King, S. (2007). Statistical analysis of the blizzard challenge 2007 listening test results. In Proceedings of blizzard workshop (in Proc. SSW6), Bonn.
Damper, R., Marchand, Y., Adamson, M., & Gustafson, K. (1999). Evaluating the pronunciation component of text-to-speech systems for english: A performance comparison of different approaches. Computer Speech and Language, 13(2), 155–176.
Dutoit, T., Pagel, V., Pierret, N., Bataille, F., & Van der Vrecken, O. (1996). The mbrola project: towards a set of high quality speech synthesizers free of use for non commercial purposes. In Proceedings of fourth international conference on spoken language. ICSLP 96 (vol. 3, pp. 1393–1396).
El-Imam, Y. (1989). An unrestricted vocabulary arabic speech synthesis system. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(12), 1829–1845.
Elshafei, M. (1991). Toward an arabic text-to-speech system. Arabian Journal for Science and Engineering, 16(4B), 565–583.
Elshafei, M., Al-Muhtaseb, H., & Al-Ghamdi, M. (2002). Techniques for high quality arabic speech synthesis. Information Sciences, 140(34), 255–267.
Fraser, M., & King, S. (2007). The blizzard challenge 2007. In Proceedings blizzard workshop (in Proc. SSW6), Bonn.
Google translate. (2014). World Wide Web electronic publication. http://translate.google.com/.
Hamad, M., & Hussain, M. (2011). Arabic text-to-speech synthesizer. In The 2011 IEEE student conference on research and development (SCOReD) (pp. 409–414). IEEE.
Hansen, J. H., & Pellom, B. L. (1998). An effective quality evaluation protocol for speech enhancement algorithms. ICSLP, 7, 2819–2822. (Citeseer).
Hirst, D., & Cristo, A. D. (1998). Intonation systems: A survey of twenty languages (1st ed.). Cambridge: Cambridge University Press.
Hon, H., Acero, A., Huang, X., Liu, J., & Plumpe, M. (1998). Automatic generation of synthesis units for trainable text-to-speech systems. In Proceedings of the 1998 IEEE international conference on acoustics, speech and signal processing, 1998 (vol. 1, pp. 293–296). IEEE.
Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of 1996 IEEE international conference on acoustics, speech, and signal processing, 1996. ICASSP-96 (vol. 1, pp. 373–376). IEEE.
Indumathi, A., & Chandra, E. (2012). Survey on speech synthesis. Signal Processing: An International Journal (SPIJ), 6(5), 140.
Jayousi, A. Q. M. A. (2007). Arabic text-to-speech synthesizer.
Khalifa, O., Obaid, M., Naji, A., & Daoud, J. I. (2011). A rule-based arabic text-to-speech system based on hybrid synthesis technique. Australian Journal of Basic and Applied Sciences, 5(6), 342–354.
Khalil, K., & Adnan, C. (2013). Arabic hmm-based speech synthesis. In International conference on electrical engineering and software applications (ICEESA), 2013 (pp. 1–5).
Klatt, D. H. (1987). Review of text-to-speech conversion for english. Journal of the Acoustical Society of America, 82(3), 737–793.
Kondo, K. (2012). Subjective quality measurement of speech. Berlin: Springer.
Leila, C., Maamar, K., & Salim, C. (2011). Combining neural networks for arabic handwriting recognition. In 10th international symposium on programming and systems (ISPS), 2011 (pp. 74–79).
Liana, M., & Venu, G. (2006). Offline arabic handwriting recognition: A survey. IEEE, Transactions on Pattern Analysis and Machine Intelligence, 28, 712–724.
Nuance vocalizer. (2014). World Wide Web electronic publication. http://enterprisecontent.nuance.com/vocalizer5-network-demo/index.html.
Rashad, M. Z., El-Bakry, H. M., & Isma’il, I. R. (2010). Diphone speech synthesis system for arabic using mary tts. International Journal of Computer Science and Information Technology (IJCSIT), 2(4), 18–26.
Rashwan, M. A., Fakhr, M. W., Attia, M., & El-Mahallawy, M. S. (2007). Arabic ocr system analogous to hmm-based asr systems implementation and evaluation. Journal of Engineering and Applied Science (JEAS), 54(6), 653.
Sakhr speech synthesizer. (2014). World Wide Web electronic publication. http://www.sakhr.com/tts/TTS_Demo.aspx.
Schrder, M., & Trouvain, J. (2003). The german text-to-speech synthesis system mary: A tool for research, development and teaching. International Journal of Speech Technology, 6(4), 365–377.
Shaker, N., Abou-Zleikha, M., & Al Dakkak, O. (2008). Ssml for arabic language. In Text, Speech and Dialogue, pp. 657–664. Springer.
Sluijter, A., Bosgoed, E., Kerkhoff, J., Meier, E., Rietveld, T., & Swerts, M., et al. (1998). Evaluation of speech synthesis systems for dutch in telecommunication applications. Jenolan Caves: In Proceedings of the Third ESCA/COCOSDA International Workshop on Speech Synthesis.
Speechworks solution division from ScanSoft, Peabody, MA (2004). White paper—Assessing text-to-speech system quality. Technical report.
Ssml. (2005). Ssml 1.0 say-as attribute values. Working group note 26 may, W3C.
Text to speech by ispeech. (2014). World Wide Web electronic publication. http://www.ispeech.org/text.to.speech.
Tratz, S. C. (2014). Accurate arabic script language/dialect classification. DTIC Document: Technical report.
Youssef, A., & Emam, O. (2004). An arabic tts system based on the ibm trainable speech synthesizer. JEP-TALN: Le traitement automatique de l’arabe.
Zeki, A. (2005). The segmentation problem on arabic character recognition the state of the art. 1st international conference on information and communication technology (ICICT) (pp. 48–57). Pakistan: Karachi.
Acknowledgments
This research project is funded by the Jordanian Scientific Research Support Fund No. EIT/1/05/2011. Thanks to Prof. Sameer Istetiah from the Arabic department in Yarmouk University, a well-known expert in the Arabic Language who provide us with this text.
Author information
Authors and Affiliations
Corresponding author
Appendix: Questionaire for testing speech synthesizer application
Appendix: Questionaire for testing speech synthesizer application
Clearness/naturalness
-
Q1) Is the voice nice listening to?
-
1.
Very natural
-
2.
Natural
-
3.
Ok
-
4.
Unnatural
-
5.
Very unnatural
-
1.
Speed
-
Q2) Does the system speak adequate fast?
-
1.
Too much fast
-
2.
Too fast
-
3.
fast/normal
-
4.
Too slow
-
5.
Too much slow
-
1.
Sound quality
-
Q3) Does you consider the system has a good sound quality?
-
1.
Very bad
-
2.
Bad
-
3.
Neutral
-
4.
Good
-
5.
Very good
-
1.
Pronunciation
-
Q4) Was it very easy to grab/get some of the words?
-
1.
Very hard
-
2.
Hard
-
3.
Neutral
-
4.
Easy
-
5.
Very easy
-
1.
-
Q5) Did you have to concentrate a lot to grab/get the speech told by the voice?
-
1.
Needs a lot of attention
-
2.
Some attention at some words
-
3.
Normal attention
-
4.
Little attention
-
5.
No attention was needed
-
1.
-
Q6) How did you find the pronunciation?
-
1.
Too much annoying
-
2.
Very annoying
-
3.
Annoying
-
4.
Little annoying
-
5.
No annoying
-
1.
Clearness
-
Q7) How much the voice is clear?
-
1.
Very little
-
2.
Little
-
3.
Neutral
-
4.
Much
-
5.
Very much
-
1.
-
Q8) Was the voice easy to grab/get?
-
1.
Very hard
-
2.
Hard
-
3.
Neutral
-
4.
Easy
-
5.
Very easy
-
1.
Stress/intonation
-
Q9) What do you think of the intonation of the voice?
-
1.
Very bad
-
2.
Bad
-
3.
Neutral
-
4.
Good
-
5.
Very good
-
1.
-
Q10) How did you find the stress?
-
1.
Too much annoying
-
2.
Very annoying
-
3.
Annoying
-
4.
Little annoying
-
5.
No annoying
-
1.
Finding error
-
Q11) Does the system make many pronunciation mistakes?
-
1.
Too many
-
2.
Many
-
3.
Neutral
-
4.
Few
-
5.
Too few
-
1.
Rights and permissions
About this article
Cite this article
Abu Doush, I., Alkhatib, F. & Bsoul, A.A.R. What we have and what is needed, how to evaluate Arabic Speech Synthesizer?. Int J Speech Technol 19, 415–432 (2016). https://doi.org/10.1007/s10772-015-9304-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-015-9304-6