research-article

Machine Generation of Audio Description for Blind and Visually Impaired People

Authors:
Virgínia P. Campos

Federal University of Rio Grande do Norte, Brazil

Federal University of Rio Grande do Norte, Brazil

0000-0002-3874-1221
View Profile

,
Luiz M. G. Gonçalves

Federal University of Rio Grande do Norte, Brazil

Federal University of Rio Grande do Norte, Brazil

0000-0002-7735-5630
View Profile

,
Wesnydy L. Ribeiro

Federal University of Paraiba, Brazil

Federal University of Paraiba, Brazil

0009-0003-3457-3886
View Profile

,
Tiago M. U. Araújo

Federal University of Paraiba, Brazil

Federal University of Paraiba, Brazil

0000-0002-5953-5435
View Profile

,
Thaís G. Do Rego

Federal University of Paraiba, Brazil

Federal University of Paraiba, Brazil

0000-0002-6608-4900
View Profile

,
Pedro H. V. Figueiredo

Federal University of Paraiba, Brazil

Federal University of Paraiba, Brazil

0000-0002-3807-1512
View Profile

,
Suanny F. S. Vieira

Federal University of Paraiba, Brazil

Federal University of Paraiba, Brazil

0009-0000-9287-7114
View Profile

,
Thiago F. S. Costa

Federal University of Paraiba, Brazil

Federal University of Paraiba, Brazil

0009-0003-6939-444X
View Profile

,
Caio C. Moraes

Federal University of Paraiba, Brazil

Federal University of Paraiba, Brazil

0009-0004-1706-8690
View Profile

,
Alexandre C. S. Cruz

Federal University of Paraiba, Brazil

Federal University of Paraiba, Brazil

0009-0001-0839-5591
View Profile

,
Felipe A. Araújo

Federal University of Paraiba, Brazil

Federal University of Paraiba, Brazil

0000-0002-7158-0589
View Profile

,
Guido L. Souza Filho

Federal University of Paraiba, Brazil

Federal University of Paraiba, Brazil

0000-0001-5834-5237
View Profile

Authors Info & Claims

ACM Transactions on Accessible Computing Volume 16 Issue 2Article No.: 14pp 1–28https://doi.org/10.1145/3590955

Published:24 June 2023Publication History

ACM Transactions on Accessible Computing

Abstract

Automating the generation of audio descriptions (AD) for blind and visually impaired (BVI) people is a difficult task, since it has several challenges involved, such as: identifying gaps in dialogues; describing the essential elements; summarizing and fitting the descriptions into the dialogue gaps; generating an AD narration track, and synchronizing it with the main soundtrack. In our previous work (Campos et al. [6]), we propose a solution for automatic AD script generation, named CineAD, which uses the movie’s script as a basis for the AD generation. This article proposes extending this solution to complement the information extracted from the script and reduce its dependency based on the classification of visual information from the video. To assess the viability of the proposed solution, we implemented a proof of concept of the solution and evaluated it with 11 blind users. The results showed that the solution could generate a more succinct and objective AD but with a similar users’ level of understanding compared to our previous work. Thus, the solution can provide relevant information to blind users using less video time for descriptions.

Supplemental Material

Available for Download

zip

taccess-04-22-0619-file002.zip (5.8 MB)

Supplementary material

REFERENCES

[1] Blind ACB—American Council of the. 2019. The Audio Description Project. Retrieved from https://www.acb.org/adp/ad.html.Google Scholar
[2] Bolaños Marc, Peris Álvaro, Casacuberta Francisco, Soler Sergi, and Radeva Petia. 2018. Egocentric video description based on temporally-linked sequences. J. Vis. Commun. Image Repres. 50 (2018), 205–216. DOI:Google ScholarDigital Library
[3] Adelson S. Flaxman, P. Briant, M. Bottone, T. Vos, K. Naidoo, T. Braithwaite, M. Cicinelli, J. Jonas, R. R. Bourne, and J.. 2020. Global prevalence of blindness and distance and near vision impairment in 2020: Progress towards the vision 2020 targets and what the future holds. Investig. Ophthalm. Vis. Sci. 61 (2020).Google Scholar
[4] Braun Sabine, Starr Kim, and Laaksonen Jorma. 2020. Comparing Human and Automated Approaches to Visual Storytelling. 159–196. DOI:Google ScholarCross Ref
[5] Calvo-Salamanca S., Coca-Castro A. F., and Velandia-Vega J. A.. 2016. Web prototype for creating descriptions and playing videos with audio description using a speech synthesizer. In Proceedings of the 8th Euro American Conference on Telematics and Information Systems (EATIS’16). 1–7. DOI:Google ScholarDigital Library
[6] Campos Virginia P., Araújo Tiago M. U. de, Filho Guido L. de Souza, and Gonçalves Luiz M. G.. 2020. CineAD: A system for automated audio description script generation for the visually impaired. Univ. Access Inf. Societ.19 (2020), 99–111. DOI:Google ScholarCross Ref
[7] Chapdelaine Claude and Gagnon Langis. 2009. Accessible videodescription on-demand. In Proceedings of the 11th International ACM SIGACCESS Conference on Computers and Accessibility (Assets’09). ACM, New York, NY, 221–222. DOI:Google ScholarDigital Library
[8] Chen Tseng-Hung, Zeng Kuo-Hao, Hsu Wan-Ting, and Sun Min. 2017. Video captioning via sentence augmentation and spatio-temporal attention. In Computer Vision—ACCV 2016 Workshops, Chen Chu-Song, Lu Jiwen, and Ma Kai-Kuang, (Eds.). Springer International Publishing, Cham, 269–286. Google Scholar
[9] Deng J., Dong W., Socher R., Li L.-J., Li K., and Fei-Fei L.. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
[10] Domingues Leonardo A., Campos Virgínia P., Araújo Tiago M. U., and Filho Guido L. de S.. 2016. Accessibility in digital cinema: A proposal for generation and distribution of audio description. In Proceedings of the 22nd Brazilian Symposium on Multimedia and the Web (Webmedia’16). ACM, New York, NY, 119–126. DOI:Google ScholarDigital Library
[11] Drossos Konstantinos, Adavanne Sharath, and Virtanen Tuomas. 2017. Automated audio captioning with recurrent neural networks. CoRR abs/1706.10006 (2017).Google Scholar
[12] Encelle Benoît, Beldame Magali Ollagnier, and Prié Yannick. 2013. Towards the usage of pauses in audio-described videos. In Proceedings of the 10th International Cross Disciplinary Conference on Web Access (W4A’13). ACM, New York, NY. DOI:Google ScholarDigital Library
[13] Fernández-Torné Anna. 2016. Audio description and technologies: Study on the semi-automatisation of the translation and voicing of audio descriptions. Ph.D. Dissertation. Universitat Autònoma de Barcelona, Spain.Google Scholar
[14] Firdus S., Ahmad W. F. W., and Janier J. B.. 2012. Development of audio video describer using narration to visualize movie film for blind and visually impaired children. In Proceedings of the International Conference on Computer and Information Science (ICCIS’12). 1068–1072. DOI:Google ScholarCross Ref
[15] L. Gagnon, S. Foucher, M. Heritier, et al. 2009. Towards computer-vision software tools to increase production and accessibility of video description for people with vision loss. Univ. Access. Inf. Soc. 8 (2009), 199–218. Google ScholarDigital Library
[16] Hurtado C. J., Rodríguez A., and Seibel C.. 2010. Un Corpus de Cine. Fundamentos Teoricos de la Audiodescripcion (A Corpus of Cinema. Theoretical Foundations of Audio Description). Universidad de Granada, Proyecto Tracce. 13–56.Google Scholar
[17] Ichiki Manon, Shimizu Toshihiro, Imai Atsushi, Takagi Tohru, Iwabuchi Mamoru, Kurihara Kiyoshi, Miyazaki Taro, Kumano Tadashi, Kaneko Hiroyuki, Sato Shoei, Seiyama Nobumasa, Yamanouchi Yuko, and Sumiyoshi Hideki. 2018. Study on automated audio descriptions overlapping live television commentary. In Computers Helping People with Special Needs, Miesenberger Klaus and Kouroupetroglou Georgios (Eds.). Springer International Publishing, Cham, 220–224. Google Scholar
[18] Karkar Abdel Ghani, Puthren Mary, and Al-ma’adeed Somaya. 2018. A bilingual scene-to-speech mobile based application. International Conference on Computer and Applications (ICCA’18), Beirut, 1–240. DOI:Google ScholarCross Ref
[19] Kobayashi Masatomo, Nagano Tohru, Fukuda Kentarou, and Takagi Hironobu. 2010. Describing online videos with text-to-speech narration. In Proceedings of the International Cross Disciplinary Conference on Web Accessibility (W4A’10). ACM, New York, NY. DOI:Google ScholarDigital Library
[20] Kobayashi Masatomo, O’Connell Trisha, Gould Bryan, Takagi Hironobu, and Asakawa Chieko. 2010. Are synthesized video descriptions acceptable? In Proceedings of the 12th International ACM SIGACCESS Conference on Computer Access (ASSETS’10). ACM, New York, NY, 163–170. DOI:Google ScholarDigital Library
[21] Lakritz J. and Salway A.. 2002. The Semi-automatic Generation of Audio Description from Screenplays, Technical Report CS-06-05. Dept. of Computing, University of Surrey.Google Scholar
[22] Lin T., Maire M., Belongie S, et al. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision.Google ScholarCross Ref
[23] Liu An-An, Xu Ning, Wong Yongkang, Li Junnan, Su Yu-Ting, and Kankanhalli Mohan. 2017. Hierarchical & multimodal video captioning. Comput. Vis. Image Underst. 163, C (Oct.2017), 113–125. DOI:Google ScholarDigital Library
[24] Lopez Mariana, Kearney Gavin, and Hofstadter Krisztian. 2018. Audio description in the UK: What works, what doesn’t, and understanding the need for personalising access. Brit. J. Vis. Impair. 36 (082018). DOI:Google ScholarCross Ref
[25] Masse Mark. 2011. REST API Design Rulebook. O’Reilly Media, Sebastopol.Google Scholar
[26] Iwona Mazur. 2020. Audio description: Concepts, theories and research approaches. In Ł. Bogucki and M. Deckert (Eds.). The Palgrave Handbook of Audiovisual Translation and Media Accessibility, Palgrave Studies in Translating and Interpreting, Palgrave Macmillan, Cham. Google ScholarCross Ref
[27] Mazur Iwona. 2020. A functional approach to audio description. J. Audiovis. Transl. 3, 2 (Dec.2020), 226–245. DOI:Google ScholarCross Ref
[28] Nguyen Khoa, Drossos Konstantinos, and Virtanen Tuomas. 2020. Temporal sub-sampling of audio feature sequences for automated audio captioning. arXiv preprint arXiv:2007.02676.Google Scholar
[29] Nunes E. V., Machado F. O., and Vanzin T.. 2011. Audiodescricao como Tecnologia Assistiva para o Acesso ao Conhecimento por Pessoas Cegas. (Audio Description as Assistive Technology for Access to Knowledge for the Blind). Pandion, Florianopolis, 191–232.Google Scholar
[30] Oliveira Rita, Abreu Jorge Ferraz de, Almeida Margarida, and Cardoso Bernardo. 2016. Inclusive approaches for audiovisual translation production in interactive television (iTV). In Proceedings of the 7th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-exclusion (DSAI’16). ACM, New York, NY, 146–153. DOI:Google ScholarDigital Library
[31] Perera M., Farook C., and Madurapperuma A. P.. 2017. Automatic video descriptor for human action recognition. In Proceedings of the National Information Technology Conference (NITC’17). 61–67. DOI:Google ScholarCross Ref
[32] Redmon Joseph and Farhadi Ali. 2016. YOLO9000: Better, faster, stronger. CoRR abs/1612.08242 (2016).Google Scholar
[33] Façanha Agebson Rocha, Oliveira Adonias Caetano de, Lima Marcos Vinicius de Andrade, Viana Windson, and Sánchez Jaime. 2016. Audio description of videos for people with visual disabilities. In Universal Access in Human-Computer Interaction. Users and Context Diversity, Antona Margherita and Stephanidis Constantine (Eds.). Springer International Publishing, Cham, 505–515. Google Scholar
[34] Szarkowska A.. 2011. Text-to-speech audio description: Towards wider availability of AD. J. Spec. Transl. 15 (2011), 142–162.Google Scholar
[35] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott E., Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2014. Going deeper with convolutions. CoRR abs/1409.4842 (2014).Google Scholar
[36] Asociación Española de Normalización. UNE-153020. 2005. Audiodescripción para Personas con Discapacidad Visual. Requisitos para la audiodescripción y elaboración de audioguías (Audio description for visually impaired people. Guidelines for audio description procedures and for the preparation of audio guides). Technical Report. AENOR. Available in: www.une.org/encuentra-tu-norma/busca-tu-norma/norma?c=N0032787.Google Scholar
[37] Organization WHO - World Health. 2019. Blindness and Vision Impairment. Retrieved from http://www.who.int/news-room/fact-sheets/detailblindness-and-visual-impairment.Google Scholar
[38] Yang Yuecong Xuand Jianfei and Mao Kezhi. 2019. Semantic-filtered soft-split-aware video captioning with audio-augmented feature. Neurocomputing 357 (2019), 24–35. DOI:Google ScholarDigital Library
[39] Yue Wang, Xiaojie Wang, and Yuzhao Mao. 2016. First-feed LSTM model for video description. J. China Univ. Posts Telecommun. 23, 3 (2016), 89–93. DOI:Google ScholarCross Ref

Index Terms

Machine Generation of Audio Description for Blind and Visually Impaired People
1. Human-centered computing
  1. Accessibility
    1. Accessibility systems and tools

Recommendations

What Makes Videos Accessible to Blind and Visually Impaired People?
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

User-generated videos are an increasingly important source of information online, yet most online videos are inaccessible to blind and visually impaired (BVI) people. To find videos that are accessible, or understandable without additional description ...
Read More
Toward Automatic Audio Description Generation for Accessible Videos
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

Video accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating high-quality audio descriptions requires a lot of ...
Read More
Interactive audio-tactile maps for visually impaired people

Visually impaired people face important challenges related to orientation and mobility. Indeed, 56% of visually impaired people in France declared having problems concerning autonomous mobility [10]. These problems often mean that visually impaired ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Accessible Computing Volume 16, Issue 2
June 2023
176 pages
ISSN:1936-7228
EISSN:1936-7236
DOI:10.1145/3596450
Editors:
Tiago Guerreiro
Universidade de Lisboa, Portugal
,
Stephanie Ludi
University of North Texas, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 June 2023
- Online AM: 14 April 2023
- Accepted: 21 March 2023
- Revised: 26 January 2023
- Received: 19 July 2021
Published in taccess Volume 16, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Audio description
automatic
accessibility
people with visual impairment
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 396
  Total Downloads
- Downloads (Last 12 months)396
- Downloads (Last 6 weeks)49
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Machine Generation of Audio Description for Blind and Visually Impaired People

ACM Transactions on Accessible Computing

Abstract

Supplemental Material

Available for Download

REFERENCES

Cited By

Index Terms

Recommendations

What Makes Videos Accessible to Blind and Visually Impaired People?

Toward Automatic Audio Description Generation for Accessible Videos

Interactive audio-tactile maps for visually impaired people