skip to main content
10.1145/3534678.3539030acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

DuIVA: An Intelligent Voice Assistant for Hands-free and Eyes-free Voice Interaction with the Baidu Maps App

Published: 14 August 2022 Publication History

Abstract

Mobile map apps such as the Baidu Maps app have become a ubiquitous and essential tool for users to find optimal routes and get turn-by-turn navigation services while driving. However, interacting with such apps while driving through visual-manual interaction modality inevitably causes driver distraction, due to the highly conspicuous nature of the time-sharing, multi-tasking behavior of the driver. In this paper, we present our efforts and findings of a 4-year longitudinal study on designing and implementing DuIVA, which is an intelligent voice assistant (IVA) embedded in the Baidu Maps app for hands-free, eyes-free human-to-app interaction in a fully voice-controlled manner. Specifically, DuIVA is designed to enable users to control the functionalities of Baidu Maps (e.g., navigation and location search) through voice interaction, rather than visual-manual interaction, which minimizes driver distraction and promotes safe driving by allowing the driver to keep "eyes on the road and hands on the wheel'' while interacting with the Baidu Maps app. DuIVA has already been deployed in production at Baidu Maps since November 2017, which facilitates a better interaction modality with the Baidu Maps app and improves the accessibility and usability of the app by providing users with in-app voice activation, natural language queries, and multi-round dialogue. As of December 31, 2021, over 530 million users have used DuIVA, which demonstrates that DuIVA is an industrial-grade and production-proven solution for in-app intelligent voice assistants.

Supplemental Material

MP4 File
Interacting with mobile map apps such as the Baidu Maps app while driving through visual-manual interaction modality inevitably causes driver distraction, due to the highly conspicuous nature of the time-sharing, multi-tasking behavior of the driver. Therefore, in order to minimize driver distraction and promote safe driving, it is important to enable users to interact with map apps in a completely hands-free and eyes-free manner. In this paper, we suggest an industrial-grade and production-proven solution DuIVA for building an in-app intelligent voice assistant. DuIVA is designed to enable users to interact with map apps through voice interaction in a completely hands-free and eyes-free manner. Experiments and analysis demonstrate that the amount of time and effort required to accomplish user-to-app interaction with DuIVA during driving is greatly reduced. DuIVA has already been deployed in production at Baidu Maps since November 2017, and over 530 million users have used DuIVA as of December 31, 2021.

References

[1]
Eric S Atwell and Stephen Elliot. 1987. Dealing with ill-formed English text. The computational analysis of English: a corpus-based approach (1987), 120--138.
[2]
Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017. A Survey on Dialogue Systems: Recent Advances and New Frontiers. ACM SIGKDD Explorations Newsletter, Vol. 19, 2 (2017), 25--35.
[3]
Miao Fan, Yibo Sun, Jizhou Huang, Haifeng Wang, and Ying Li. 2021. Meta-Learned Spatial-Temporal POI Auto-Completion for the Search Engine at Baidu Maps. In KDD. 2822--2830.
[4]
Xiaomin Fang, Jizhou Huang, Fan Wang, Lihang Liu, Yibo Sun, and Haifeng Wang. 2021. SSML: Self-Supervised Meta-Learner for En Route Travel Time Estimation at Baidu Maps. In KDD. 2840--2848.
[5]
Xiaomin Fang, Jizhou Huang, Fan Wang, Lingke Zeng, Haijin Liang, and Haifeng Wang. 2020. ConSTGAT: Contextual Spatial-Temporal Graph Attention Network for Travel Time Estimation at Baidu Maps. In KDD. 2697--2705.
[6]
Gregory M Fitch, Susan A Soccolich, Feng Guo, Julie McClafferty, et al. 2013. The Impact of Hand-Held and Hands-Free Cell Phone Use on Driving Performance and Safety-Critical Event Risk. Technical Report.
[7]
David Goddeau, Helen Meng, Joseph Polifroni, Stephanie Seneff, and Senis Busayapongchai. 1996. A form-based dialogue manager for spoken language applications. In ICSLP. 701--704.
[8]
Agust'in Gravano and Julia Hirschberg. 2011. Turn-taking cues in task-oriented dialogue. Computer Speech & Language, Vol. 25, 3 (2011), 601--634.
[9]
A. Howard, Menglong Zhu, Bo Chen, D. Kalenichenko, Weijun Wang, T. Weyand, M. Andreetto, and H. Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[10]
Jizhou Huang, Shiqiang Ding, Haifeng Wang, and Ting Liu. 2018. Learning to Recommend Related Entities With Serendipity for Web Search Users. ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, 3, Article 25 (April 2018), 22 pages.
[11]
Jizhou Huang, Haifeng Wang, Miao Fan, An Zhuo, and Ying Li. 2020 a. Personalized Prefix Embedding for POI Auto-Completion in the Search Engine of Baidu Maps. In KDD. 2677--2685.
[12]
Jizhou Huang, Haifeng Wang, Yibo Sun, Miao Fan, Zhengjie Huang, Chunyuan Yuan, and Yawen Li. 2021. HGAMN: Heterogeneous Graph Attention Matching Network for Multilingual POI Retrieval at Baidu Maps. In KDD. 3032--3040.
[13]
Jizhou Huang, Haifeng Wang, Yibo Sun, Yunsheng Shi, Zhengjie Huang, An Zhuo, and Shikun Feng. 2022. ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu Maps. In KDD .
[14]
Jizhou Huang, Haifeng Wang, Wei Zhang, and Ting Liu. 2020 b. Multi-Task Learning for Entity Recommendation and Document Ranking in Web Search. ACM Trans. Intell. Syst. Technol., Vol. 11, 5, Article 54 (July 2020), 24 pages.
[15]
Bret Kinsella and Ava Mutchler. 2019. In-Car Voice Assistant Consumer Adoption Report.
[16]
David R Large, Gary Burnett, Ben Anyasodo, and Lee Skrypchuk. 2016. Assessing cognitive demand during natural language interactions with a digital driving assistant. In AutomotiveUI. 67--74.
[17]
Nilli Lavie. 2010. Attention, distraction, and cognitive control under load. Current directions in psychological science, Vol. 19, 3 (2010), 143--148.
[18]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017b. Focal loss for dense object detection. In ICCV. 2980--2988.
[19]
Zhouhan Lin, Minwei Feng, C. N. Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Y. Bengio. 2017a. A structured self-attentive sentence embedding. In ICLR .
[20]
David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In NAACL. 152--159.
[21]
Bastian Pfleging, Stefan Schneegass, and Albrecht Schmidt. 2012. Multimodal interaction in the car: combining speech and gestures on the steering wheel. In AutomotiveUI. 155--162.
[22]
Andreas Riener, Myounghoon Jeon, Ignacio Alvarez, and Anna K Frison. 2017. Driver in the loop: Best practices in automotive sensing and feedback mechanisms. In AutomotiveUI. 295--323.
[23]
Florian Roider, Sonja Rumelin, Bastian Pfleging, and Tom Gross. 2017. The effects of situational demands on gaze, speech and gesture input in the vehicle. In AutomotiveUI. 94--102.
[24]
Burr Settles. 2009. Active learning literature survey. Technical Report. University of Wisconsin--Madison Department of Computer Sciences.
[25]
Yibo Sun, Jizhou Huang, Chunyuan Yuan, Miao Fan, Haifeng Wang, Ming Liu, and Bing Qin. 2021. GEDIT: Geographic-Enhanced and Dependency-Guided Tagging for Joint POI and Accessibility Extraction at Baidu Maps. In CIKM. 4135--4144.
[26]
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie 2.0: A continual pre-training framework for language understanding. In AAAI. 8968--8975.
[27]
Amrita S Tulshan and Sudhir Namdeorao Dhage. 2018. Survey on virtual assistant: Google assistant, siri, cortana, alexa. In SIRS. 190--201.
[28]
Gokhan Tur and Renato De Mori. 2011. Spoken language understanding: Systems for extracting semantic information from speech .John Wiley & Sons.
[29]
Walter W Wierwille. 1993. Demands on driver resources associated with introducing advanced technology into the vehicle. TR_C, Vol. 1, 2 (1993), 133--142.
[30]
Jinhua Xiong, Qiao Zhang, Shuiyuan Zhang, et al. 2015. HANSpeller: a unified framework for Chinese spelling correction. In IJCLCLP. 1--22.
[31]
Baoshi Yan, Fuliang Weng, Zhe Feng, Florin Ratiu, et al. 2007. A conversational in-car dialog system. In NAACL. 23--24.
[32]
Zhao Yan, Nan Duan, Peng Chen, Ming Zhou, et al. 2017. Building Task-Oriented Dialogue Systems for Online Shopping. In AAAI. 4618--4625.
[33]
Yi Yang and Arzoo Katiyar. 2020. Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning. In EMNLP. 6365--6375.
[34]
Ruiqing Zhang, Chao Pang, Chuanqiang Zhang, Shuohuan Wang, Zhongjun He, Yu Sun, Hua Wu, and Haifeng Wang. 2021. Correcting Chinese spelling errors with phonetic pre-training. In Findings of ACL. 2250--2261.

Cited By

View all
  • (2025)Intelligent Mission Commander: A Novel Voice Interaction Framework for Air Confrontation ScenariosnAdvances in Guidance, Navigation and Control10.1007/978-981-96-2236-8_2(12-23)Online publication date: 4-Mar-2025
  • (2024)Zero-configuration Alarms: Towards Reducing Distracting Smartphone Interactions while DrivingACM Journal on Computing and Sustainable Societies10.1145/3675159Online publication date: 11-Jul-2024
  • (2024)AdaptiveVoice: Cognitively Adaptive Voice Interface for Driving AssistanceProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642876(1-18)Online publication date: 11-May-2024
  • Show More Cited By

Index Terms

  1. DuIVA: An Intelligent Voice Assistant for Hands-free and Eyes-free Voice Interaction with the Baidu Maps App

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    August 2022
    5033 pages
    ISBN:9781450393850
    DOI:10.1145/3534678
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 August 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. baidu maps
    2. eyes-free
    3. hands-free
    4. intelligent voice assistant
    5. task-oriented dialogue
    6. user-to-app interaction
    7. voice interaction

    Qualifiers

    • Research-article

    Conference

    KDD '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)65
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Intelligent Mission Commander: A Novel Voice Interaction Framework for Air Confrontation ScenariosnAdvances in Guidance, Navigation and Control10.1007/978-981-96-2236-8_2(12-23)Online publication date: 4-Mar-2025
    • (2024)Zero-configuration Alarms: Towards Reducing Distracting Smartphone Interactions while DrivingACM Journal on Computing and Sustainable Societies10.1145/3675159Online publication date: 11-Jul-2024
    • (2024)AdaptiveVoice: Cognitively Adaptive Voice Interface for Driving AssistanceProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642876(1-18)Online publication date: 11-May-2024
    • (2024)Aircraft human‐machine interaction assistant design: A novel multimodal data processing and application frameworkIET Control Theory & Applications10.1049/cth2.1275418:18(2742-2765)Online publication date: 28-Oct-2024
    • (2023)LittleMu: Deploying an Online Virtual Teaching Assistant via Heterogeneous Sources Integration and Chain of Teach PromptsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615484(4843-4849)Online publication date: 21-Oct-2023
    • (2023)Cyclists’ Use of Technology While on Their BikeProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580971(1-15)Online publication date: 19-Apr-2023
    • (2023)Matching Point of Interests and Travel Blog with Multi-view Information FusionProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592016(2149-2153)Online publication date: 19-Jul-2023
    • (2022)ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu MapsProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539021(3029-3039)Online publication date: 14-Aug-2022
    • (2022)DuIVRS: A Telephonic Interactive Voice Response System for Large-Scale POI Attribute Acquisition at Baidu MapsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557131(3182-3191)Online publication date: 17-Oct-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media