short-paper

Yelling at Your TV: An Analysis of Speech Recognition Errors and Subsequent User Behavior on Entertainment Systems

Authors:

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 853 - 856

https://doi.org/10.1145/3331184.3331271

Published: 18 July 2019 Publication History

Get Access

Abstract

Millions of consumers issue voice queries through television-based entertainment systems such as the Comcast X1, the Amazon Fire TV, and Roku TV. Automatic speech recognition (ASR) systems are responsible for transcribing these voice queries into text to feed downstream natural language understanding modules. However, ASR is far from perfect, often producing incorrect transcriptions and forcing users to take corrective action. To better understand their impact on sessions, this paper characterizes speech recognition errors as well as subsequent user responses. We provide both quantitative and qualitative analyses, examining the acoustic as well as lexical attributes of the utterances. This work represents, to our knowledge, the first analysis of speech recognition errors from real users on a widely-deployed entertainment system.

References

[1]

A. Black and K. Lenzo. 2001. Flite: A Small Fast Run-Time Synthesis Engine. In 4th ISCA Workshop on Speech Synthesis.

Google Scholar

[2]

C.-C. Chiu, T. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani. 2018. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models. In ICASSP. 4774--4778.

Google Scholar

[3]

I. Guy. 2016. Searching by Talking: Analysis of Voice Queries on Mobile Web Search. In SIGIR. 35--44.

Digital Library

Google Scholar

[4]

J. Huang and E. Efthimiadis. 2009. Analyzing and Evaluating Query Reformulation Strategies in Web Search Logs. In CIKM. 77--86.

Digital Library

Google Scholar

[5]

J. Jiang, A. Awadallah, R. Jones, U. Ozertem, I. Zitouni, R. Kulkarni, and O. Khan. 2015. Automatic Online Evaluation of Intelligent Assistants. In WWW. 506--516.

Digital Library

Google Scholar

[6]

J. Jiang, W. Jeng, and D. He. 2013. How Do Users Respond to Voice Input Errors? Lexical and Phonetic Query Reformulation in Voice Search. In SIGIR. 143--152.

Digital Library

Google Scholar

[7]

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger. 2017. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In Interspeech. 498--502.

Google Scholar

[8]

J. Rao, F. Ture, H. He, O. Jojic, and J. Lin. 2017. Talking to Your TV: Context-Aware Voice Search with Hierarchical Recurrent Neural Networks. In CIKM. 557--566.

Digital Library

Google Scholar

[9]

J. Rao, F. Ture, and J. Lin. 2018. Multi-Task Learning with Neural Networks for Voice Query Understanding on an Entertainment Platform. In KDD. 636--645.

Digital Library

Google Scholar

[10]

J. Rao, F. Ture, and J. Lin. 2018. What Do Viewers Say to Their TVs? An Analysis of Voice Queries to Entertainment Systems. In SIGIR. 1213--1216.

Digital Library

Google Scholar

[11]

J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, and B. Strope. 2010. "Your Word is My Command": Google Search by Voice: A Case Study. In Advances in Speech Recognition. Springer, 61--90.

Google Scholar

[12]

M. Shokouhi, R. Jones, U. Ozertem, K. Raghunathan, and F. Diaz. 2014. Mobile Query Reformulations. In SIGIR. 1011--1014.

Digital Library

Google Scholar

[13]

Y.-Y. Wang, D. Yu, Y.-C. Ju, and A. Acero. 2008. An Introduction to Voice Search. IEEE Signal Processing Magazine, Vol. 25, 3 (2008).

Google Scholar

[14]

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. 2016. Achieving Human Parity in Conversational Speech Recognition. arXiv:1610.05256.

Google Scholar

Cited By

View all

Bae HLee YHan KKim S(2023)A Competition-Aware Approach to Accurate TV Show Recommendation2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00216(2822-2834)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00216
Tang RKumar KXin JVyas PLi WYang GMao YMurray CLin J(2022)Temporal Early Exiting for Streaming Speech Commands RecognitionICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746863(7567-7571)Online publication date: 23-May-2022
https://doi.org/10.1109/ICASSP43922.2022.9746863
Barbasán Ortuño IPérez-Sabater C(2021)La subtitulación intralingüística en la docencia de lenguas de especialidadAlsic10.4000/alsic.5409Online publication date: 29-Dec-2021
https://doi.org/10.4000/alsic.5409
Show More Cited By

Index Terms

Yelling at Your TV: An Analysis of Speech Recognition Errors and Subsequent User Behavior on Entertainment Systems
1. Computing methodologies
  1. Artificial intelligence
    1. Distributed artificial intelligence
      1. Intelligent agents
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Improving Automatic Speech Recognizer of Voice Search Using System Combination
FSKD '09: Proceedings of the 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 04

Voice search is the technology that enables users to access information using spoken queries. Automatic speech recognizer (ASR) is one of the key modules for voice search systems. However, the high error rate of the state-of-the-art large vocabulary ...
Improving automatic speech recognizer of voice search using system combination
FSKD'09: Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 4

Voice search is the technology that enables users to access information using spoken queries. Automatic speech recognizer (ASR) is one of the key modules for voice search systems. However, the high error rate of the state-of-the-art large vocabulary ...
Lexical speaker identification in TV shows

It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using ...

Comments

Information & Contributors

Information

Published In

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2019

1512 pages

ISBN:9781450361729

DOI:10.1145/3331184

General Chairs:
Benjamin Piwowarski
CNRS - Sorbonne Universite, France
,
Max Chevalier
Universite de Toulouse, CNRS, France
,
Eric Gaussier
Universite Grenoble Alpes, CNRS, France
,
Program Chairs:
Yoelle Maarek
Amazon Research, Israel
,
Jian-Yun Nie
University of Montreal, Canada
,
Falk Scholer
RMIT University, Australia

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Natural Sciences and Engineering Research Council of Canada

Conference

SIGIR '19

Sponsor:

SIGIR

SIGIR '19: The 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 21 - 25, 2019

Paris, France

Acceptance Rates

SIGIR'19 Paper Acceptance Rate 84 of 426 submissions, 20%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
264
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)3

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Bae HLee YHan KKim S(2023)A Competition-Aware Approach to Accurate TV Show Recommendation2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00216(2822-2834)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00216
Tang RKumar KXin JVyas PLi WYang GMao YMurray CLin J(2022)Temporal Early Exiting for Streaming Speech Commands RecognitionICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746863(7567-7571)Online publication date: 23-May-2022
https://doi.org/10.1109/ICASSP43922.2022.9746863
Barbasán Ortuño IPérez-Sabater C(2021)La subtitulación intralingüística en la docencia de lenguas de especialidadAlsic10.4000/alsic.5409Online publication date: 29-Dec-2021
https://doi.org/10.4000/alsic.5409
Liu JBaltes STreude CLo DZhang YXia XSpinellis DGousios GChechik MDi Penta M(2021)Characterizing search activities on stack overflowProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468582(919-931)Online publication date: 20-Aug-2021
https://dl.acm.org/doi/10.1145/3468264.3468582
Li WTure FHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)Auto-annotation for Voice-enabled Entertainment SystemsProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401241(1557-1560)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401241
Vatavu RSaeghe PChambel TVinayagamoorthy VUrsu M(2020)Conceptualizing Augmented Reality Television for the Living RoomProceedings of the 2020 ACM International Conference on Interactive Media Experiences10.1145/3391614.3393660(1-12)Online publication date: 17-Jun-2020
https://dl.acm.org/doi/10.1145/3391614.3393660
Ture FRao JTang RLin JPiwowarski BChevalier MGaussier EMaarek YNie JScholer F(2019)Challenges and Opportunities in Understanding Spoken Queries Directed at Modern Entertainment PlatformsProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331433(1375-1376)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1145/3331184.3331433

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Improving Automatic Speech Recognizer of Voice Search Using System Combination

Improving automatic speech recognizer of voice search using system combination

Lexical speaker identification in TV shows

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations