research-article

Automatic Online Evaluation of Intelligent Assistants

Authors:
Jiepu Jiang

University of Massachusetts Amherst, Amherst, MA, USA

University of Massachusetts Amherst, Amherst, MA, USA
View Profile

,
Ahmed Hassan Awadallah

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Rosie Jones

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Umut Ozertem

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Imed Zitouni

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Ranjitha Gurunath Kulkarni

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Omar Zia Khan

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

WWW '15: Proceedings of the 24th International Conference on World Wide WebMay 2015Pages 506–516https://doi.org/10.1145/2736277.2741669

Published:18 May 2015Publication History

WWW '15: Proceedings of the 24th International Conference on World Wide Web

Pages 506–516

ABSTRACT

Voice-activated intelligent assistants, such as Siri, Google Now, and Cortana, are prevalent on mobile devices. However, it is challenging to evaluate them due to the varied and evolving number of tasks supported, e.g., voice command, web search, and chat. Since each task may have its own procedure and a unique form of correct answers, it is expensive to evaluate each task individually. This paper is the first attempt to solve this challenge. We develop consistent and automatic approaches that can evaluate different tasks in voice-activated intelligent assistants. We use implicit feedback from users to predict whether users are satisfied with the intelligent assistant as well as its components, i.e., speech recognition and intent classification. Using this approach, we can potentially evaluate and compare different tasks within and across intelligent assistants ac-cording to the predicted user satisfaction rates. Our approach is characterized by an automatic scheme of categorizing user-system interaction into task-independent dialog actions, e.g., the user is commanding, selecting, or confirming an action. We use the action sequence in a session to predict user satisfaction and the quality of speech recognition and intent classification. We also incorporate other features to further improve our approach, including features derived from previous work on web search satisfaction prediction, and those utilizing acoustic characteristics of voice requests. We evaluate our approach using data collected from a user study. Results show our approach can accurately identify satisfactory and unsatisfactory sessions.

References

Ageev, M., Guo, Q., Lagun, D. and Agichtein, E. (2011). Find it if you can: a game for modeling different types of web search success using interaction data. Proc. SIGIR '11, 345--354. Google ScholarDigital Library
Smith R.W. and Hipp, D.R. (1995). Spoken Natural Language Dialog Systems: A Practical Approach. Oxford University Press. Google ScholarDigital Library
Feild, H.A., Allan, J. and Jones, R. (2010). Predicting searcher frustration. Proc. SIGIR '10, 34--41. Google ScholarDigital Library
Fox, S., Karnawat, K., Mydland, M., Dumais, S. and White, T. 2005. Evaluating implicit measures to improve web search. ACM TOIS, 23(2), 147--168. Google ScholarDigital Library
Friedman, J., Hastie, T., Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics.Google ScholarCross Ref
Hassan, A. (2012). A semi-supervised approach to modeling web search satisfaction. Proc. SIGIR '12, 275--284. Google ScholarDigital Library
Hassan, A., Jones, R. and Klinkner, K.L. (2010). Beyond DCG: user behavior as a predictor of a successful search. Proc. WSDM '10, 221--230. Google ScholarDigital Library
Hassan, A., Shi, X., Craswell, N. and Ramsey, B. (2013). Beyond clicks: query reformulation as a predictor of search satisfaction. Proc. CIKM '13, 2019--2028. Google ScholarDigital Library
Hassan, A., Song, Y. and He, L. (2011). A task level metric for measuring web search satisfaction and its application on improving relevance estimation. Proc. CIKM '11, 125--134. Google ScholarDigital Library
Hassan, A., White, R.W., Dumais, S.T. and Wang, Y.M. (2014). Proc. WSDM '14, 53--62. Google ScholarDigital Library
Heck, L.P., Hakkani-Tür, D., Chinthakunta, M., Tür, G., Iyer, R., Parthasarathy, P., Stifelman, L., Shriberg, E. and Fidler, A. (2013). Multi-modal conversational search and browse. Proceedings of the First Workshop on Speech, Language and Audio in Multimedia, 96--101.Google Scholar
Huang, P.S., Kumar, K., Liu, C., Gong, Y. and Deng, L. (2013). Predicting speech recognition confidence using deep learning with word identity and score features. Proc. ICASSP, 7413--7417.Google ScholarCross Ref
Huffman, S.B. and Hochster, M. (2007). How well does result relevance predict session satisfaction? Proc. SIGIR '07, 567--574. Google ScholarDigital Library
Järvelin, K. and Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM TOIS, 20(4), 422--446. Google ScholarDigital Library
Järvelin, K., Price, S., Delcambre, L.L. and Nielsen, M. (2008). Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions. Proc. ECIR '08, 4--15. Google ScholarDigital Library
Jeng, W., Jiang, J. and He, D. (2013). Users' Perceived Difficulties and Corresponding Reformulation Strategies in Voice Search. Proc. HCIR 2013.Google Scholar
Jiang, J., Hassan Awadallah, A., Shi, X. and White, R.W. (2015). Understanding and Predicting Graded Search Satisfaction. Proc. WSDM '15, 57--66. Google ScholarDigital Library
Jiang, J., He, D. and Allan, J. (2014). Searching, browsing, and clicking in a search session. Proc. SIGIR '14, 607--616. Google ScholarDigital Library
Jiang, J., Jeng, W. and He, D. (2013). How do users respond to voice input errors' lexical and phonetic query reformulation in voice search. Proc. SIGIR '13, 143--152. Google ScholarDigital Library
Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S. and Maloor, P. (2002). MATCH: An architecture for multimodal dialogue systems. Proc. ACL '02, 376--383. Google ScholarDigital Library
Kim, Y., Hassan, A., White, R.W. and Zitouni, I. (2014). Comparing client and server dwell time estimates for click-level satisfaction prediction. Proc. SIGIR '14, 895--898. Google ScholarDigital Library
Kim, Y., Hassan, A., White, R.W. and Zitouni, I. (2014). Modeling dwell time to predict click-level satisfaction. Proc. WSDM '14, 193--202. Google ScholarDigital Library
Kotov, A., Bennett, P.N., White, R.W., Dumais, S.T. and Teevan, J. (2011). Modeling and analysis of cross-session search tasks. Proc. SIGIR '11, 5--14. Google ScholarDigital Library
Niu, X. and Kelly, D. (2014). The use of query suggestions during information search. IP&M, 50(1), 218--234. Google ScholarDigital Library
Philips, L. (1990). Hanging on the Metaphone. Computer Language, 7(12), 39--44.Google Scholar
Shokouhi, M., Jones, R., Ozertem, U., Raghunathan, K. and Diaz, F. 2014. Mobile query reformulations. Proc. SIGIR '14, 1011--1014. Google ScholarDigital Library
Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., Taylor, P., Martin, R., Van Ess-Dykema, C. and Meteer, M. (2000). Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational linguistics, 26(3), 339--373. Google ScholarDigital Library
Traum, D.R. (2000). 20 questions on dialogue act taxonomies. Journal of semantics, 17(1), 7--30.Google ScholarCross Ref
Tur, G. and De Mori, R. (2011). Spoken language understanding: Systems for extracting semantic information from speech. John Wiley & Sons.Google Scholar
Wahlster, W. (2006). SmartKom: foundations of multimodal dialogue systems. Springer. Google ScholarDigital Library
Walker, M.A., Litman, D.J., Kamm, C.A. and Abella, A. (1997). PARADISE: A framework for evaluating spoken dialogue agents. Proc. ACL '97, 271--280. Google ScholarDigital Library
Wang, H., Song, Y., Chang, M.W., He, X., Hassan, A. and White, R.W. (2014). Modeling action-level satisfaction for search task satisfaction prediction. SIGIR '14, 123--132. Google ScholarDigital Library
Young, S., Gasic, M., Thomson, B. and Williams, J.D. (2013). POMDP-based statistical spoken dialog systems: A review. Proc. IEEE, 101(5), 1160--1179.Google ScholarCross Ref

Index Terms

Automatic Online Evaluation of Intelligent Assistants
1. Hardware
  1. Communication hardware, interfaces and storage
    1. Sound-based input / output
2. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Predicting User Satisfaction with Intelligent Assistants
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

There is a rapid growth in the use of voice-controlled intelligent personal assistants on mobile devices, such as Microsoft's Cortana, Google Now, and Apple's Siri. They significantly change the way users interact with search systems, not only because ...
Read More
Understanding User Satisfaction with Intelligent Assistants
CHIIR '16: Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval

Voice-controlled intelligent personal assistants, such as Cortana, Google Now, Siri and Alexa, are increasingly becoming a part of users' daily lives, especially on mobile devices. They introduce a significant change in information access, not only by ...
Read More
Research on English pronunciation training based on intelligent speech recognition

When learning English, Chinese students tend to spend a lot of time in practicing reading and writing skills, while neglecting their ability to speak English. This study presented a speech recognition-based intelligent spoken English pronunciation ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '15: Proceedings of the 24th International Conference on World Wide Web
May 2015
1460 pages
ISBN:9781450334693
General Chairs:
Aldo Gangemi
National Research Council, Italy & Paris 13 University-CNRS, France
,
Stefano Leonardi
Sapienza University of Rome, Italy
,
Alessandro Panconesi
Sapienza University of Rome, Italy
Copyright © 2015 Copyright is held by the International World Wide Web Conference Committee (IW3C2)
Sponsors
In-Cooperation
Publisher
International World Wide Web Conferences Steering Committee
Republic and Canton of Geneva, Switzerland
Publication History
- Published: 18 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
evaluation
mobile search
spoken dialog system.
user experience
voice-activated intelligent assistant
Qualifiers
- research-article
Conference

Acceptance Rates
WWW '15 Paper Acceptance Rate131of929submissions,14%Overall Acceptance Rate1,899of8,196submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 108
  Total Citations
  View Citations
- 1,418
  Total Downloads
- Downloads (Last 12 months)61
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic Online Evaluation of Intelligent Assistants

WWW '15: Proceedings of the 24th International Conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Predicting User Satisfaction with Intelligent Assistants

Understanding User Satisfaction with Intelligent Assistants

Research on English pronunciation training based on intelligent speech recognition