skip to main content
research-article
Open access

End-Users Know Best: Identifying Undesired Behavior of Alexa Skills Through User Review Analysis

Published: 09 September 2024 Publication History

Abstract

The Amazon Alexa marketplace has grown rapidly in recent years due to third-party developers creating large amounts of content and publishing directly to a skills store. Despite the growth of the Amazon Alexa skills store, there have been several reported security and usability concerns, which may not be identified during the vetting phase. However, user reviews can offer valuable insights into the security & privacy, quality, and usability of the skills. To better understand the effects of these problematic skills on end-users, we introduce ReviewTracker, a tool capable of discerning and classifying semantically negative user reviews to identify likely malicious, policy violating, or malfunctioning behavior on Alexa skills. ReviewTracker employs a pre-trained FastText classifier to identify different undesired skill behaviors. We collected over 700,000 user reviews spanning 6 years with more than 200,000 negative sentiment reviews. ReviewTracker was able to identify 17,820 reviews reporting violations related to Alexa policy requirements across 2,813 skills, and 131,855 reviews highlighting different types of user frustrations associated with 9,294 skills. In addition, we developed a dynamic skill testing framework using ChatGPT to conduct two distinct types of tests on Alexa skills: one using a software-based simulation for interaction to explore the actual behaviors of skills and another through actual voice commands to understand the potential factors causing discrepancies between intended skill functionalities and user experiences. Based on the number of the undesired skill behavior reviews, we tested the top identified problematic skills and detected more than 228 skills violating at least one policy requirement. Our results demonstrate that user reviews could serve as a valuable means to identify undesired skill behaviors.

References

[1]
Accountability Act. 1996. Health insurance portability and accountability act of 1996. Public law 104 (1996), 191.
[2]
LOVO AI. Year. Getting Started with Genny API. https://lovo.ai/post/getting-started-with-genny-api [Accessed: 23-Jan-2024].
[3]
Mohammed Aldeen, Joshua Luo, Ashley Lian, Venus Zheng, Allen Hong, Preethika Yetukuri, and Long Cheng. 2023. ChatGPT vs. Human Annotators: A Comprehensive Analysis of ChatGPT for Text Annotation. In 2023 International Conference on Machine Learning and Applications (ICMLA). IEEE, 602--609.
[4]
Alexa Simulator. [n.d.]. https://developer.amazon.com/en-US/docs/alexa/devconsole/alexa-simulator.html. [Accessed: 30-Jan-2024].
[5]
Alexa Skills Privacy Requirements. [n. d.]. https://developer.amazon.com/fr-FR/docs/alexa/custom-skills/policy-requirements-for-an-alexa-skill.html. [Accessed: 22-May-2023].
[6]
Alexa Skills Security Requirements. [n. d.]. https://developer.amazon.com/fr-FR/docs/alexa/custom-skills/security-testing-for-an-alexa-skill.html. [Accessed: 22-May-2023].
[7]
M. Ali, M. E. Joorabchi, and A. Mesbah. 2017. Same App, Different App Stores: A Comparative Study. In 2017 IEEE/ACM 4th International Conference on Mobile Software Engineering and Systems (MOBILESoft). 79--90.
[8]
Soodeh Atefi, Andrew Truelove, Matheus Rheinschmitt, Eduardo Santana de Almeida, Iftekhar Ahmed, and Amin Alipour. 2020. Examining user reviews of conversational systems: a case study of Alexa skills. CoRR abs/2003.00919 (2020). arXiv:2003.00919 https://arxiv.org/abs/2003.00919
[9]
Timothy Bickmore and Justine Cassell. 2001. Relational agents: a model and implementation of building user trust. In Proceedings of the SIGCHI conference on Human factors in computing systems. 396--403.
[10]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.
[11]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the association for computational linguistics 5 (2017), 135--146.
[12]
Long Cheng, Christin Wilson, Song Liao, Jeffrey Young, Daniel Dong, and Hongxin Hu. 2020. Dangerous Skills Got Certified: Measuring the Trustworthiness of Skill Certification in Voice Personal Assistant Platforms. In ACM SIGSAC Conference on Computer and Communications Security (CCS).
[13]
Leigh Clark, Nadia Pantidi, Orla Cooney, Philip Doyle, Diego Garaialde, Justin Edwards, Brendan Spillane, Emer Gilmartin, Christine Murad, Cosmin Munteanu, et al. 2019. What makes a good conversation? Challenges in designing truly conversational agents. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1--12.
[14]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37--46.
[15]
Federal Trade Commission et al. 1998. Children's online privacy protection act of 1998 (COPPA).
[16]
Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2020. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020).
[17]
Vitor Mesaque Alves de Lima, Jacson Rodrigues Barbosa, and Ricardo Marcondes Marcacini. 2023. Learning Risk Factors from App Reviews: A Large Language Model Approach for Risk Matrix Construction. (2023).
[18]
Andrea Di Sorbo, Sebastiano Panichella, Carol V. Alexandru, Junji Shimagaki, Corrado A. Visaggio, Gerardo Canfora, and Harald C. Gall. 2016. What Would Users Change in My App? Summarizing App Reviews for Recommending Software Changes. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 499--510.
[19]
Paulo Sérgio Henrique Dos Santos, Alberto Dumont Alves Oliveira, Thais Bonjorni Nobre De Jesus, Wajdi Aljedaani, and Marcelo Medeiros Eler. 2023. Evolution may come with a price: analyzing user reviews to understand the impact of updates on mobile apps accessibility. In Proceedings of the XXII Brazilian Symposium on Human Factors in Computing Systems. 1--11.
[20]
Satyam Dwivedi, Sanjukta Ghosh, and Shivam Dwivedi. 2023. Breaking the Bias: Gender Fairness in LLMs Using Prompt Engineering and In-Context Learning. Rupkatha Journal on Interdisciplinary Studies in Humanities 15, 4 (2023).
[21]
Jide Edu, Xavi Ferrer Aran, Jose Such, and Guillermo Suarez-Tangil. 2021. SkillVet: Automated Traceability Analysis of Amazon Alexa Skills. IEEE Transactions on Dependable and Secure Computing (2021).
[22]
Anna Fleck. 2023. Alexa, What's America's Favorite Smart Speaker? Statista. https://www.statista.com/chart/23943/share-of-us-adults-who-own-smart-speakers/ [Accessed: 01-May-2024].
[23]
Nathaniel Fruchter and Ilaria Liccardi. 2018. Consumer Attitudes Towards Privacy and Security in Home Assistants. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems.
[24]
Bin Fu, Jialiu Lin, Lei Li, Christos Faloutsos, Jason I. Hong, and Norman M. Sadeh. 2013. Why people hate your app: making sense of user feedback in a mobile app store. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD. 1276--1284.
[25]
C. Gao, J. Zeng, M. R. Lyu, and I. King. 2018. Online App Review Analysis for Identifying Emerging Issues. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). 48--58.
[26]
Yang Gao, Zhengyu Pan, Honghao Wang, and Guanling Chen. 2018. Alexa, My Love: Analyzing Reviews of Amazon Echo. In 2018 IEEE SmartWorld, Ubiquitous Intelligence Computing, Advanced Trusted Computing, Scalable Computing Communications, Cloud Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). 372--380. https://doi.org/10.1109/SmartWorld.2018.00094
[27]
Vaibhav Garg, Hui Guo, Nirav Ajmeri, Saikath Bhattacharya, and Munindar P Singh. 2023. irogue: Identifying rogue behavior from app reviews. arXiv preprint arXiv:2303.10795 (2023).
[28]
Diego Guffanti, Danilo Martínez, José Paladines, and Andrea Sarmiento. 2018. Continuous speech recognition and identification of the speaker system. In Proceedings of the International Conference on Information Technology & Systems (ICITS 2018). Springer, 767--776.
[29]
Zhixiu Guo, Zijin Lin, Pan Li, and Kai Chen. 2020. SkillExplorer: Understanding the Behavior of Skills in Large Scale. In 29th {USENIX} Security Symposium ({USENIX} Security 20). 2649--2666.
[30]
Hamza Harkous, Sai Teja Peddinti, Rishabh Khandelwal, Animesh Srivastava, and Nina Taft. 2022. Hark: A deep learning system for navigating privacy feedback at scale. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2469--2486.
[31]
Drew Harwell. 2018. Amazon's Alexa and Google Home show accent bias, with Chinese and Spanish hardest to understand. (2018). https://www.scmp.com/magazines/post-magazine/long-reads/article/2156455/amazons-alexa-and-google-home-show-accent-bias
[32]
Darren Hayes, Francesco Cappa, and Nhien An Le-Khac. 2020. An effective approach to mobile device management: Security and privacy issues associated with mobile applications. Digital Business 1, 1 (2020), 100001. https://doi.org/10.1016/j.digbus.2020.100001
[33]
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. 50--57.
[34]
Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. (2017). To appear.
[35]
Yangyu Hu, Haoyu Wang, Tiantong Ji, Xusheng Xiao, Xiapu Luo, Peng Gao, and Yao Guo. 2021. Champ: Characterizing undesired app behaviors from user comments based on market policies. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 933--945.
[36]
Rajesh Kumar Jaiswal, Shyam Sundar Sharma, and Rajkumar Kaushik. 2023. ETHICS IN AI AND MACHINE LEARNING. Journal of Nonlinear Analysis and Optimization (2023). https://api.semanticscholar.org/CorpusID:266320296
[37]
Erin Kenneally and David Dittrich. 2012. The menlo report: Ethical principles guiding information and communication technology research. Available at SSRN 2445102 (2012).
[38]
Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R Rickford, Dan Jurafsky, and Sharad Goel. 2020. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences 117, 14 (2020), 7684--7689.
[39]
Bartłomiej Koptyra, Anh Ngo, Łukasz Radliński, and Jan Kocoń. 2023. CLARIN-Emo: Training Emotion Recognition Models Using Human Annotation and ChatGPT. In International Conference on Computational Science. Springer, 365--379.
[40]
Deepak Kumar, Riccardo Paccagnella, Paul Murley, Eric Hennenfent, Joshua Mason, Adam Bates, and Michael Bailey. 2018. Skill Squatting Attacks on Amazon Alexa. In 27th USENIX Security Symposium (USENIX Security). 33--47.
[41]
Tu Le, Danny Yuxing Huang, Noah Apthorpe, and Yuan Tian. 2022. Skillbot: Identifying risky content for children in alexa skills. ACM Transactions on Internet Technology (TOIT) 22, 3 (2022), 1--31.
[42]
Christopher Lentzsch, Sheel Jayesh Shah, Benjamin Andow, Martin Degeling, Anupam Das, and William Enck. 2021. Hey Alexa, is this skill safe?: Taking a closer look at the Alexa skill ecosystem. Network and Distributed Systems Security (NDSS) Symposium2021 (2021).
[43]
Suwan Li, Lei Bu, Guangdong Bai, Zhixiu Guo, Kai Chen, and Hanlin Wei. 2022. VITAS: Guided Model-based VUI Testing of VPA Apps. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1--12.
[44]
Song Liao, Mohammed Aldeen, Jingwen Yan, Long Cheng, Xiapu Luo, Haipeng Cai, and Hongxin Hu. 2024. Understanding GDPR Non-Compliance in Privacy Policies of Alexa Skills in European Marketplaces. In Proceedings of the ACM on Web Conference 2024. 1081--1091.
[45]
Ewa Luger and Abigail Sellen. 2016. " Like Having a Really Bad PA" The Gulf between User Expectation and Experience of Conversational Agents. In Proceedings of the 2016 CHI conference on human factors in computing systems. 5286--5297.
[46]
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55--60.
[47]
Sid Mittal, Vineet Gupta, Frederick Liu, and Mukund Sundararajan. 2023. Using Foundation Models to Detect Policy Violations with Minimal Supervision. arXiv preprint arXiv:2306.06234 (2023).
[48]
D. C. Nguyen, E. Derr, M. Backes, and S. Bugiel. 2019. Short Text, Large Effect: Measuring the Impact of User Reviews on Android App Security Privacy. In 2019 IEEE Symposium on Security and Privacy (SP). 555--569.
[49]
David Rozado. 2023. The political biases of chatgpt. Social Sciences 12, 3 (2023), 148.
[50]
Selenium WebDriver. [n. d.]. https://www.selenium.dev.
[51]
Faysal Hossain Shezan, Hang Hu, Gang Wang, and Yuan Tian. 2020. VerHealth: Vetting Medical Voice Applications through Policy Enforcement. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. (2020).
[52]
Faysal Hossain Shezan, Hang Hu, Gang Wang, and Yuan Tian. 2020. VerHealth: Vetting Medical Voice Applications through Policy Enforcement. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. (2020).
[53]
Lisa Singh, Agoritsa Polyzou, Yanchen Wang, Jason Farr, and Carole Roan Gresenz. 2020. Social Media Data-Our Ethical Conundrum. A Quarterly bulletin of the IEEE Computer Society Technical Committee on Database Engineering (2020).
[54]
Chuanqi Tao, Hongjing Guo, and Zhiqiu Huang. 2020. Identifying security issues for mobile applications based on user review summarization. Information and Software Technology 122 (2020), 106290.
[55]
Venture Beat. [n. d.]. https://venturebeat.com/2019/09/25/the-alexa-skills-store-now-has-more-than-100000-voice-apps/.
[56]
Swaathi Vetrivel, Veerle Van Harten, Carlos H Gañán, Michel Van Eeten, and Simon Parkin. 2023. Examining consumer reviews to understand security and privacy issues in the market of smart home devices. In 32nd USENIX Security Symposium (USENIX Security 23). 1523--1540.
[57]
Phong Minh Vu, Tam The Nguyen, Hung Viet Pham, and Tung Thanh Nguyen. 2015. Mining User Opinions in Mobile App Reviews: A Keyword-Based Approach. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering. 749--459.
[58]
Payton Walker, Nathan McClaran, Zihao Zheng, Nitesh Saxena, and Guofei Gu. 2022. BiasHacker: Voice Command Disruption by Exploiting Speaker Biases in Automatic Speech Recognition. In Proceedings of the 15th ACM Conference on Security and Privacy in Wireless and Mobile Networks. 119--124.
[59]
Dawei Wang, Kai Chen, and Wei Wang. 2021. Demystifying the Vetting Process of Voice-Controlled Skills on Markets. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 3 (2021).
[60]
Run Wang, Zhibo Wang, Benxiao Tang, Lei Zhao, and Lina Wang. 2020. SmartPI: Understanding Permission Implications of Android Apps from User Reviews. IEEE Transactions on Mobile Computing 19, 12 (2020), 2933--2945. https://doi.org/10.1109/TMC.2019.2934441
[61]
Jeffrey Young, Song Liao, Long Cheng, Hongxin Hu, and Huixing Deng. 2021. SkillDetective: Automated Policy-Violation Detection of Voice Assistant Applications in the Wild.
[62]
L. Yu, J. Chen, H. Zhou, X. Luo, and K. Liu. 2018. Localizing Function Errors in Mobile Apps with User Reviews. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 418--429.
[63]
Nan Zhang, Xianghang Mi, Xuan Feng, XiaoFeng Wang, Yuan Tian, and Feng Qian. 2019. Understanding and Mitigating the Security Risks of Voice-Controlled Third-Party Skills on Amazon Alexa and Google Home. In IEEE Symposium on Security and Privacy (SP).
[64]
Wenyu Zhang, Xiaojuan Wang, Shanyan Lai, Chunyang Ye, and Hui Zhou. 2022. Fine-Tuning Pre-Trained Model to Extract Undesired Behaviors from App Reviews. In 2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS). IEEE, 1125--1134.
[65]
Shen Zheng, Jie Huang, and Kevin Chen-Chuan Chang. 2023. Why does chatgpt fall short in answering questions faithfully? arXiv preprint arXiv:2304.10513 (2023).

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 8, Issue 3
September 2024
1782 pages
EISSN:2474-9567
DOI:10.1145/3695755
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2024
Published in IMWUT Volume 8, Issue 3

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 558
    Total Downloads
  • Downloads (Last 12 months)558
  • Downloads (Last 6 weeks)91
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media