Abstract
Since it is impractical to prerecord human speech for dynamic content such as email messages and news, many commercial speech applications use recorded human speech for fixed content (e.g. system prompts) and synthetic speech for dynamic content. However, mixing human speech and synthetic speech may not be optimal from a consistency perspective. A two-condition between-participants experiment (N = 24) was conducted to compare two versions of a telephony application for Personal Information Management (PIM). In the first condition, all the system output was delivered with synthetic speech. In the second condition, users heard a mix of human speech and synthetic speech. Users managed several email and calendar tasks. Users' task performance was rated by two independent judges. Their self-ratings of task performance and attitudinal responses were also measured by means of questionnaires. Users interacting with the interface that used only synthetic speech performed the task significantly better, while users interacting with the mixed-speech interface thought they did better and had more positive attitudinal responses. A consistency framework drawn from human psychological processing is offered to explain the difference in task performance. Cognitive processing and attitudinal response are differentiated. Design implications and directions for future research are suggested.
Similar content being viewed by others
References
Asch, S.E. (1946). Forming impressions of personality. Journal of Abnormal and Social Psychology, 41:1230-1240.
Dyer, F.N. (1973). Interference and facilitation for color naming with separate bilateral presentations of the word and color. Journal of Experimental Psychology, 99:314-317.
Gong, L. (2001). Pairing media-captured human versus computersynthesized humanoid faces and voices for talking heads:Aconsistency theory for interface agents. Doctoral Dissertation, Stanford University, California.
Gong, L., Nass, C., Simard, C., and Takhteyev, Y. (2001). When non-human is better than semi-human: Consistency in speech interfaces. In M.J. Smith, G. Salvendy, D. Harris, and R. Koubek (Eds.), Usability Evaluation and Interface Design: Cognitive Engineering, Intelligent Agents, and Virtual Reality. Mahwah, NJ: Lawrence Erlbaum Associates, pp. 1558-1562.
Hamers, J.F. and Lambert,W.E. (1972). Bilingual interdependencies in auditory perception. Journal of Verbal Learning and Verbal Behavior, 11:303-310.
Isbister,K. and Nass, C. (2000). Consistency of personality in interactive characters: Verbal cues, non-verbal cues, and user characteristics. International Journal of Human-Computer Studies, 53:251-267.
Kahneman, D. and Chajczyk, D. (1983). Tests of the automaticity of reading: Dilution of Stroop effects by color-irrelevant stimuli. Journal of Experimental Psychology: Human Perception and Performance, 9:497-509.
Kelley, H.H. (1967). Attribution theory in social psychology. In D. Levine (Ed.), Nebraska Symposium on Motivation. Lincoln, NE: University of Nebraska Press, vol. 15, pp. 192-240.
Lai, J., Wood, D., and Considine, M. (2000). The effect of task conditions on the comprehensibility of synthetic speech. Proceedings of the Conference on Human Factors in Computing Systems (CHI '00), The Hague, The Netherlands: ACMPress, pp. 321-328.
McInnes, F.R., Attwater, D.J., Edgington, M.D., Schmidt, M.S., and Jack, M.A. (1999). User attitudes to concatenated natural speech and text-to-speech synthesis in an automated information service. Proceedings of Eurospeech '99 (European Conference on Speech Communication and Technology). Budapest, Hungary, pp. 831-834.
Nass, C. and Lee, K.M. (2001). Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. Journal of Experimental Psychology: Applied, 7(3):171-181.
Olive, J.P. (1997). “The talking computer”: Text-to-speech synthesis. In D.G. Stork (Ed.), HAL's Legacy: 2001's Computer as Dream and Reality. Cambridge, MA: MIT Press, pp. 101-131.
Ralston, J.V., Pisoni, D.B., and Mullennix, J.W. (1995). Perception and comprehension of speech. In A.K. Syrdal, R.W. Bennett, and S.L. Greenspan (Eds.), Applied Speech Technology. Boca Raton, FL: CRC Press, pp. 233-288.
Roy, L. and Sawyers, J.K. (1990). Interpreting subtle inconsistency and consistency: A developmental-clinical perspective. Journal of Genetic Psychology, 151:515-521.
Spiegel, M.F. (1997). Advanced database preprocessing and preparation that enable telecommunication services based on speech synthesis. Speech Communication, 22:51-62.
Stroop, J.R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18:643-663.
van Santen, J., Macon, M., Cronk, A., Hosom, P., Kain, A., Pagel, V., and Wouters, J. (2000). When will synthetic speech sound human: Role of rules and data. Proceedings of International Conference of Spoken Language Processing. Beijing, China, pp. 878-882.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Gong, L., Lai, J. To Mix or Not to Mix Synthetic Speech and Human Speech? Contrasting Impact on Judge-Rated Task Performance versus Self-Rated Performance and Attitudinal Responses. International Journal of Speech Technology 6, 123–131 (2003). https://doi.org/10.1023/A:1022382413579
Issue Date:
DOI: https://doi.org/10.1023/A:1022382413579