skip to main content
10.1145/2736277.2741669acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Automatic Online Evaluation of Intelligent Assistants

Published:18 May 2015Publication History

ABSTRACT

Voice-activated intelligent assistants, such as Siri, Google Now, and Cortana, are prevalent on mobile devices. However, it is challenging to evaluate them due to the varied and evolving number of tasks supported, e.g., voice command, web search, and chat. Since each task may have its own procedure and a unique form of correct answers, it is expensive to evaluate each task individually. This paper is the first attempt to solve this challenge. We develop consistent and automatic approaches that can evaluate different tasks in voice-activated intelligent assistants. We use implicit feedback from users to predict whether users are satisfied with the intelligent assistant as well as its components, i.e., speech recognition and intent classification. Using this approach, we can potentially evaluate and compare different tasks within and across intelligent assistants ac-cording to the predicted user satisfaction rates. Our approach is characterized by an automatic scheme of categorizing user-system interaction into task-independent dialog actions, e.g., the user is commanding, selecting, or confirming an action. We use the action sequence in a session to predict user satisfaction and the quality of speech recognition and intent classification. We also incorporate other features to further improve our approach, including features derived from previous work on web search satisfaction prediction, and those utilizing acoustic characteristics of voice requests. We evaluate our approach using data collected from a user study. Results show our approach can accurately identify satisfactory and unsatisfactory sessions.

References

  1. Ageev, M., Guo, Q., Lagun, D. and Agichtein, E. (2011). Find it if you can: a game for modeling different types of web search success using interaction data. Proc. SIGIR '11, 345--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Smith R.W. and Hipp, D.R. (1995). Spoken Natural Language Dialog Systems: A Practical Approach. Oxford University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Feild, H.A., Allan, J. and Jones, R. (2010). Predicting searcher frustration. Proc. SIGIR '10, 34--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Fox, S., Karnawat, K., Mydland, M., Dumais, S. and White, T. 2005. Evaluating implicit measures to improve web search. ACM TOIS, 23(2), 147--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Friedman, J., Hastie, T., Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics.Google ScholarGoogle ScholarCross RefCross Ref
  6. Hassan, A. (2012). A semi-supervised approach to modeling web search satisfaction. Proc. SIGIR '12, 275--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Hassan, A., Jones, R. and Klinkner, K.L. (2010). Beyond DCG: user behavior as a predictor of a successful search. Proc. WSDM '10, 221--230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hassan, A., Shi, X., Craswell, N. and Ramsey, B. (2013). Beyond clicks: query reformulation as a predictor of search satisfaction. Proc. CIKM '13, 2019--2028. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hassan, A., Song, Y. and He, L. (2011). A task level metric for measuring web search satisfaction and its application on improving relevance estimation. Proc. CIKM '11, 125--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hassan, A., White, R.W., Dumais, S.T. and Wang, Y.M. (2014). Proc. WSDM '14, 53--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Heck, L.P., Hakkani-Tür, D., Chinthakunta, M., Tür, G., Iyer, R., Parthasarathy, P., Stifelman, L., Shriberg, E. and Fidler, A. (2013). Multi-modal conversational search and browse. Proceedings of the First Workshop on Speech, Language and Audio in Multimedia, 96--101.Google ScholarGoogle Scholar
  12. Huang, P.S., Kumar, K., Liu, C., Gong, Y. and Deng, L. (2013). Predicting speech recognition confidence using deep learning with word identity and score features. Proc. ICASSP, 7413--7417.Google ScholarGoogle ScholarCross RefCross Ref
  13. Huffman, S.B. and Hochster, M. (2007). How well does result relevance predict session satisfaction? Proc. SIGIR '07, 567--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Järvelin, K. and Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM TOIS, 20(4), 422--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Järvelin, K., Price, S., Delcambre, L.L. and Nielsen, M. (2008). Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions. Proc. ECIR '08, 4--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jeng, W., Jiang, J. and He, D. (2013). Users' Perceived Difficulties and Corresponding Reformulation Strategies in Voice Search. Proc. HCIR 2013.Google ScholarGoogle Scholar
  17. Jiang, J., Hassan Awadallah, A., Shi, X. and White, R.W. (2015). Understanding and Predicting Graded Search Satisfaction. Proc. WSDM '15, 57--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jiang, J., He, D. and Allan, J. (2014). Searching, browsing, and clicking in a search session. Proc. SIGIR '14, 607--616. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jiang, J., Jeng, W. and He, D. (2013). How do users respond to voice input errors' lexical and phonetic query reformulation in voice search. Proc. SIGIR '13, 143--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S. and Maloor, P. (2002). MATCH: An architecture for multimodal dialogue systems. Proc. ACL '02, 376--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kim, Y., Hassan, A., White, R.W. and Zitouni, I. (2014). Comparing client and server dwell time estimates for click-level satisfaction prediction. Proc. SIGIR '14, 895--898. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kim, Y., Hassan, A., White, R.W. and Zitouni, I. (2014). Modeling dwell time to predict click-level satisfaction. Proc. WSDM '14, 193--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kotov, A., Bennett, P.N., White, R.W., Dumais, S.T. and Teevan, J. (2011). Modeling and analysis of cross-session search tasks. Proc. SIGIR '11, 5--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Niu, X. and Kelly, D. (2014). The use of query suggestions during information search. IP&M, 50(1), 218--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Philips, L. (1990). Hanging on the Metaphone. Computer Language, 7(12), 39--44.Google ScholarGoogle Scholar
  26. Shokouhi, M., Jones, R., Ozertem, U., Raghunathan, K. and Diaz, F. 2014. Mobile query reformulations. Proc. SIGIR '14, 1011--1014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., Taylor, P., Martin, R., Van Ess-Dykema, C. and Meteer, M. (2000). Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational linguistics, 26(3), 339--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Traum, D.R. (2000). 20 questions on dialogue act taxonomies. Journal of semantics, 17(1), 7--30.Google ScholarGoogle ScholarCross RefCross Ref
  29. Tur, G. and De Mori, R. (2011). Spoken language understanding: Systems for extracting semantic information from speech. John Wiley & Sons.Google ScholarGoogle Scholar
  30. Wahlster, W. (2006). SmartKom: foundations of multimodal dialogue systems. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Walker, M.A., Litman, D.J., Kamm, C.A. and Abella, A. (1997). PARADISE: A framework for evaluating spoken dialogue agents. Proc. ACL '97, 271--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Wang, H., Song, Y., Chang, M.W., He, X., Hassan, A. and White, R.W. (2014). Modeling action-level satisfaction for search task satisfaction prediction. SIGIR '14, 123--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Young, S., Gasic, M., Thomson, B. and Williams, J.D. (2013). POMDP-based statistical spoken dialog systems: A review. Proc. IEEE, 101(5), 1160--1179.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Automatic Online Evaluation of Intelligent Assistants

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            WWW '15: Proceedings of the 24th International Conference on World Wide Web
            May 2015
            1460 pages
            ISBN:9781450334693

            Copyright © 2015 Copyright is held by the International World Wide Web Conference Committee (IW3C2)

            Publisher

            International World Wide Web Conferences Steering Committee

            Republic and Canton of Geneva, Switzerland

            Publication History

            • Published: 18 May 2015

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            WWW '15 Paper Acceptance Rate131of929submissions,14%Overall Acceptance Rate1,899of8,196submissions,23%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader