Skip to main content
Log in

Expanding the MOS: Development and Psychometric Evaluation of the MOS-R and MOS-X

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The Mean Opinion Scale (MOS) is a questionnaire used to obtain listeners' subjective assessments of synthetic speech. This paper documents the motivation, method, and results of six experiments conducted from 1999 to 2002 that investigated the psychometric properties of the MOS and expanded the range of speech characteristics it evaluates. Our initial experiments documented the reliability, validity, sensitivity, and factor structure of the P.L. Salza et al. (Acta Acustica, Vol. 82, pp. 650–656, 1996) MOS and used psychometric principles to revise and improve the scale. This work resulted in the MOS-Revised (MOS-R). Four subsequent experiments expanded the MOS-R beyond its previous focus on Intelligibility and Naturalness, to include measurement of the Prosody and Social Impression of synthetic voices. As a result of this work, we created the MOS-Expanded (MOS-X), a rating scale shown to be reliable, valid, and sensitive for high-quality evaluation of synthetic speech in applied industrial settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Baken, R. (1978). Clinical Measurement of Speech and Voice. Boston: Allyn & Bacon.

    Google Scholar 

  • Berry, D. (1992). Vocal types and stereotypes: Joint effects of vocal attractiveness and vocal maturity on person perception. Journal of Nonverbal Behavior, 16:41-45.

    Google Scholar 

  • Bloom, K., Zajac, D., and Titus, J. (1999). The influence of nasality of voice on sex-stereotyped perceptions. Journal of Nonverbal Behavior, 23:271-281.

    Google Scholar 

  • Bradlow, A., Torretta, G., and Pisoni, D. (1996). Intelligibility of normal speech I: Global and fine-grained acoustic-phonetic talker characteristics. Speech Communication, 20:255-272.

    Google Scholar 

  • Brown, B., Strong, W., and Rencher, A. (1973). Perceptions of personality from speech: Effects of manipulations of acoustical parameters. Journal of the Acoustical Society of America, 54:29-35.

    Google Scholar 

  • Brown, B., Strong, W., and Rencher, A. (1975). Acoustic determinants of perceptions of personality from speech. International Journal of the Sociology of Language, 6:1-32.

    Google Scholar 

  • Cliff, N. (1987). Analyzing Multivariate Data. San Diego, CA: Harcourt Brace Jovanovich.

  • Coovert, M.D. and McNelis, K. (1988). Determining the number of common factors in factor analysis: A review and program. Educational and Psychological Measurement, 48:687-693.

    Google Scholar 

  • Ekman, P., O'Sullivan, M., Friesen,W., and Scherer,K. (1991). Face, voice, and body in detecting deceit. Journal of Nonverbal Behavior, 15:125-135.

    Google Scholar 

  • Francis, A.L. and Nusbaum, H.C. (1999). Evaluating the quality of synthetic speech. In D. Gardner-Bonneau (Ed.), Human Factors and Voice Interactive Systems. Boston, MA: Kluwer, pp. 63-97.

    Google Scholar 

  • Goldstein, M. (1995). Classification of methods used for assessment of text-to-speech systems according to the demands placed on the listener. Speech Communication, 16:225-244.

    Google Scholar 

  • Gorenflo, D. and Gorenflo, C. (1997). Effects of synthetic speech, gender, and perceived similarity on attitudes toward the augmented communicator. AAC: Augmentative and Alternative Communication, 13:87-91.

    Google Scholar 

  • Granstrom, B. and Nord, L. (1992). Neglected dimensions in speech synthesis. Speech Communication, 11:459-462.

    Google Scholar 

  • Greene, B., Logan, J., and Pisoni, D. (1986). Perception of synthetic speech produced automatically by rule: Intelligibility of eight textto-speech systems. Behavior Research Methods, Instruments, and Computers, 18:100-107.

    Google Scholar 

  • Hieda, I. and Kuchinomachi,Y. (1997). Preliminary study of relations between physical characteristics and psychological impressions of natural voices. Perceptual and Motor Skills, 85:1483-1491.

    Google Scholar 

  • Higashikawa,M. and Minifie, F. (1999). Acoustical-perceptual correlates of 'whisper pitch' in synthetically generated vowels. Journal of Speech, Language, and Hearing Research, 42:583-591.

    Google Scholar 

  • Hillenbrand, J. (1988). Perception of aperiodicities in synthetically generated voices. Journal of the Acoustical Society of America, 83:2361-2371.

    Google Scholar 

  • Hoag, L. and Bedrosian, J. (1992). Effects of speech output type, message length, and reauditorization on perceptions of the communicative competence of an adult AAC user. Journal of Speech and Hearing Research, 35:1363-1366.

    Google Scholar 

  • Holtgraves, T. and Lasky, B. (1999). Linguistic power and persuasion. Journal of Language and Social Psychology, 18:1960-205.

    Google Scholar 

  • Hosman, L. (1989). The evaluative consequences of hedges, hesitations, and intensifiers: Powerful and powerless speech styles. Human Communication Research, 15:383-406.

    Google Scholar 

  • International Telecommunication Union (1994). A Method for Subjective Performance Assessment of the Quality of Speech Voice Output Devices (ITU-T Recommendation, p. 85). Geneva, Switzerland: ITU.

    Google Scholar 

  • Johnston, R.D. (1996). Beyond intelligibility: The performance of text-to-speech synthesisers. BT Technology Journal, 14:100-111.

    Google Scholar 

  • Johnson,W., Emde, R., Scherer, K., and Klinnert, M. (1986). Recognition of emotion from vocal cues. Archives of General Psychiatry, 43:280-283.

    Google Scholar 

  • Klatt, D. and Klatt, L. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America, 87:820-857.

    Google Scholar 

  • Koopmans-Van Beinum, F. (1992). The role of focus words in natural and in synthetic continuous speech: Acoustic aspects. Speech Communication, 11:439-452.

    Google Scholar 

  • Kraft, V. and Portele, T. (1995). Quality evaluation of five German speech synthesis systems. Acta Acustica, 3:351-365.

    Google Scholar 

  • Landauer, T.K. (1988). Research methods in human-computer interaction. In M. Helander (Ed.), Handbook of Human-Computer Interaction. New York: Elsevier.

    Google Scholar 

  • Lavner, Y., Gath, I., and Rosenhouse, J. (2000). The effects of acoustic modifications on the identification of familiar voices speaking isolated vowels. Speech Communication, 30:9-26.

    Google Scholar 

  • Lewis, J.R. (1993). Multipoint scales: Mean and median differences and observed significance levels. International Journal of Human-Computer Interaction, 5:383-392.

    Google Scholar 

  • Lewis, J.R. (2001a). Psychometric properties of the Mean Opinion Scale. In Proceedings of HCI International 2001: Usability Evaluation and Interface Design. Mahwah, NJ: Lawrence Erlbaum, pp. 149-153.

  • Lewis, J.R. (2001b). The Revised Mean Opinion Scale (MOS-R): Preliminary Psychometric Evaluation (Tech. Report 29.3414). Raleigh, NC: International Business Machines Corp.

    Google Scholar 

  • Martin, R. and Haroldson, S. (1992). Stuttering and speech naturalness: Audio and audiovisual judgments. Journal of Speech and Hearing Research, 35:521-528.

    Google Scholar 

  • Massaro, D. and Egan, P. (1996). Perceiving affect from the voice and the face. Psychonomic Bulletin & Review, 3:215-221.

    Google Scholar 

  • Miyake, K. and Zuckerman, M. (1993). Beyond personality: Effects of physical and vocal attractiveness on false consensus, social comparison, affiliation, and assumed and perceived personality. Journal of Personality, 61:411-437.

    Google Scholar 

  • Moller, S., Jekosch, U., Mersdorf, J., and Kraft, V. (2001). Auditory assessment of synthesized speech in application scenarios: Two case studies. Speech Communication, 34:229-246.

    Google Scholar 

  • Munsterburg, H. (1913). Psychology and industrial efficiency. In L.T. Benjamin Jr. (Ed.), A History of Psychology: Original Sources and Contemporary Research, 2nd edn. Boston: McGrawHill, pp. 584-593.

    Google Scholar 

  • Murray, I. and Arnott, J. (1993). Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. Journal of the Acoustical Society of America, 93:1097-1108.

    Google Scholar 

  • Murray, I. and Arnott, J. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16:369-390.

    Google Scholar 

  • Murray, I., Arnott, J., and Rohwer, E. (1996). Emotional stress in synthetic speech: Progress and future directions. Speech Communication, 20:85-91.

    Google Scholar 

  • Nunnally, J.C. (1978). Psychometric Theory. New York: McGraw-Hill.

    Google Scholar 

  • Paddock, J. and Nowicki, S. (1986). Paralanguage and the interpersonal impact of dysphoria: It's not what you say but how you say it. Social Behavior and Personality, 14:29-44.

    Google Scholar 

  • Page, R. and Balloun, J. (1978). The effect of voice volume on the perception of personality. Journal of Social Psychology, 105:65-72.

    Google Scholar 

  • Paris, C.R., Thomas, M.H., Gilson, R.D., and Kincaid, J.P. (2000). Linguistic cues and memory for synthetic and natural speech. Human Factors, 42:421-431.

    Google Scholar 

  • Pelachaud, C., Badler, N., and Steedman, M. (1996). Generating facial expressions for speech. Cognitive Science, 20:1-46.

    Google Scholar 

  • Pisoni, D. (1997). Perception of synthetic speech. In J. van Santen, R. Sproat, J. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer, pp. 541-560.

    Google Scholar 

  • Pols, L. and Jekosch, U. (1997). A structured way of looking at the performance of text-to-speech systems. In J. van Santen, R. Sproat, J. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer, pp. 519-528.

    Google Scholar 

  • Portele, T. and Heuft, B. (1997). Toward a prominence-based synthesis system. Speech Communication, 21:61-72.

    Google Scholar 

  • Salza, P.L., Foti, E., Nebbia, L., and Oreglia, M. (1996). MOS and pair comparison combined methods for quality evaluation of text to speech systems. Acta Acustica, 82:650-656.

    Google Scholar 

  • Schmidt-Nielsen, A. (1995). Intelligibility and acceptability testing for speech technology. In A. Syrdal, R. Bennett, and S. Greenspan (Eds.), Applied Speech Technology. Boca Raton: CRC Press.

    Google Scholar 

  • Shipley, K. and McAfee, J. (1992). Assessment in Speech Language Pathology: A Resource Manual. San Diego: Singular.

  • Sonntag, G.P. and Portele, T. (1998). PURR-A method for prosody evaluation and investigation. Computer Speech and Language, 12:437-451.

    Google Scholar 

  • Sonntag, G.P., Portele, T., Haas, F., and Kohler, J. (1999). Comparative evaluation of six German TTS systems. Eurospeech '99. Budapest: Technical University of Budapest, pp. 251-254.

  • Slowiaczek, L. and Nusbaum, H. (1985). Effects of speech rate and pitch contour on the perception of synthetic speech. Human Factors, 27:701-712.

    Google Scholar 

  • Stern, S., Mullennix, J., Dyson, C., and Wilson, S. (1999). The persuasiveness of synthetic speech versus human speech. HumanFactors, 41:588-595.

    Google Scholar 

  • Tartter, V. and Braun, D. (1994). Hearing smiles and frowns in normal and whisper registers. Journal of the Acoustical Society of America, 96:2101-2107.

    Google Scholar 

  • van Bezooijen, R. and van Heuven, V. (1997). Assessment of synthesis systems. In D. Gibbon, R. Moore, and R. Winski (Eds.), Handbook of Standards and Resources for Spoken Language Systems. New York, NY: Mouton de Gruyter.

    Google Scholar 

  • Wang, H. and Lewis, J.R. (2001). Intelligibility and acceptability of short phrases generated by embedded text-to-speech engines. In Proceedings of HCI International 2001: Usability Evaluation and Interface Design. Mahwah, NJ: Lawrence Erlbaum, pp. 144-148.

    Google Scholar 

  • Whalen, D. and Hoequist, C. (1995). The effects of breath sounds on the perception of synthetic speech. Journal of the Acoustical Society of America, 97:3147-3153.

    Google Scholar 

  • Whitmore, J. and Fisher, S. (1996). Speech during sustained operations. Speech Communication, 20:55-70.

    Google Scholar 

  • Yabuoka, H., Nakayama,T., Kitabayashi,Y., and Asakawa,Y. (2000). Investigations of independence of distortion scales in objective evaluation of synthesized speech quality. Electronics and Communications in Japan, Part 3, 83:14-22.

    Google Scholar 

  • van Riper, C. and Emerick, L. (1990). Speech Correction. Englewood Cliffs, NJ: Prentice Hall.

    Google Scholar 

  • Yaeger-Dror, M. (1996). Register as a variable in prosodic analysis: The case of the English negative. Speech Communication, 19:39-60.

    Google Scholar 

  • Zuckerman, M., Miyake, K., and Hodgins, H. (1991). Cross-channel effects of vocal and physical attractiveness and their implications for interpersonal perception. Journal of Personality and Social Psychology, 60:545-554.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Polkosky, M.D., Lewis, J.R. Expanding the MOS: Development and Psychometric Evaluation of the MOS-R and MOS-X. International Journal of Speech Technology 6, 161–182 (2003). https://doi.org/10.1023/A:1022390615396

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1022390615396

Navigation