skip to main content
research-article

Machine Generation of Audio Description for Blind and Visually Impaired People

Authors Info & Claims
Published:24 June 2023Publication History
Skip Abstract Section

Abstract

Automating the generation of audio descriptions (AD) for blind and visually impaired (BVI) people is a difficult task, since it has several challenges involved, such as: identifying gaps in dialogues; describing the essential elements; summarizing and fitting the descriptions into the dialogue gaps; generating an AD narration track, and synchronizing it with the main soundtrack. In our previous work (Campos et al. [6]), we propose a solution for automatic AD script generation, named CineAD, which uses the movie’s script as a basis for the AD generation. This article proposes extending this solution to complement the information extracted from the script and reduce its dependency based on the classification of visual information from the video. To assess the viability of the proposed solution, we implemented a proof of concept of the solution and evaluated it with 11 blind users. The results showed that the solution could generate a more succinct and objective AD but with a similar users’ level of understanding compared to our previous work. Thus, the solution can provide relevant information to blind users using less video time for descriptions.

Skip Supplemental Material Section

Supplemental Material

REFERENCES

  1. [1] Blind ACB—American Council of the. 2019. The Audio Description Project. Retrieved from https://www.acb.org/adp/ad.html.Google ScholarGoogle Scholar
  2. [2] Bolaños Marc, Peris Álvaro, Casacuberta Francisco, Soler Sergi, and Radeva Petia. 2018. Egocentric video description based on temporally-linked sequences. J. Vis. Commun. Image Repres. 50 (2018), 205216. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Adelson S. Flaxman, P. Briant, M. Bottone, T. Vos, K. Naidoo, T. Braithwaite, M. Cicinelli, J. Jonas, R. R. Bourne, and J.. 2020. Global prevalence of blindness and distance and near vision impairment in 2020: Progress towards the vision 2020 targets and what the future holds. Investig. Ophthalm. Vis. Sci. 61 (2020).Google ScholarGoogle Scholar
  4. [4] Braun Sabine, Starr Kim, and Laaksonen Jorma. 2020. Comparing Human and Automated Approaches to Visual Storytelling. 159196. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Calvo-Salamanca S., Coca-Castro A. F., and Velandia-Vega J. A.. 2016. Web prototype for creating descriptions and playing videos with audio description using a speech synthesizer. In Proceedings of the 8th Euro American Conference on Telematics and Information Systems (EATIS’16). 17. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Campos Virginia P., Araújo Tiago M. U. de, Filho Guido L. de Souza, and Gonçalves Luiz M. G.. 2020. CineAD: A system for automated audio description script generation for the visually impaired. Univ. Access Inf. Societ.19 (2020), 99111. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chapdelaine Claude and Gagnon Langis. 2009. Accessible videodescription on-demand. In Proceedings of the 11th International ACM SIGACCESS Conference on Computers and Accessibility (Assets’09). ACM, New York, NY, 221222. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Chen Tseng-Hung, Zeng Kuo-Hao, Hsu Wan-Ting, and Sun Min. 2017. Video captioning via sentence augmentation and spatio-temporal attention. In Computer Vision—ACCV 2016 Workshops, Chen Chu-Song, Lu Jiwen, and Ma Kai-Kuang, (Eds.). Springer International Publishing, Cham, 269286. Google ScholarGoogle Scholar
  9. [9] Deng J., Dong W., Socher R., Li L.-J., Li K., and Fei-Fei L.. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Domingues Leonardo A., Campos Virgínia P., Araújo Tiago M. U., and Filho Guido L. de S.. 2016. Accessibility in digital cinema: A proposal for generation and distribution of audio description. In Proceedings of the 22nd Brazilian Symposium on Multimedia and the Web (Webmedia’16). ACM, New York, NY, 119126. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Drossos Konstantinos, Adavanne Sharath, and Virtanen Tuomas. 2017. Automated audio captioning with recurrent neural networks. CoRR abs/1706.10006 (2017).Google ScholarGoogle Scholar
  12. [12] Encelle Benoît, Beldame Magali Ollagnier, and Prié Yannick. 2013. Towards the usage of pauses in audio-described videos. In Proceedings of the 10th International Cross Disciplinary Conference on Web Access (W4A’13). ACM, New York, NY. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Fernández-Torné Anna. 2016. Audio description and technologies: Study on the semi-automatisation of the translation and voicing of audio descriptions. Ph.D. Dissertation. Universitat Autònoma de Barcelona, Spain.Google ScholarGoogle Scholar
  14. [14] Firdus S., Ahmad W. F. W., and Janier J. B.. 2012. Development of audio video describer using narration to visualize movie film for blind and visually impaired children. In Proceedings of the International Conference on Computer and Information Science (ICCIS’12). 10681072. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] L. Gagnon, S. Foucher, M. Heritier, et al. 2009. Towards computer-vision software tools to increase production and accessibility of video description for people with vision loss. Univ. Access. Inf. Soc. 8 (2009), 199–218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Hurtado C. J., Rodríguez A., and Seibel C.. 2010. Un Corpus de Cine. Fundamentos Teoricos de la Audiodescripcion (A Corpus of Cinema. Theoretical Foundations of Audio Description). Universidad de Granada, Proyecto Tracce. 13–56.Google ScholarGoogle Scholar
  17. [17] Ichiki Manon, Shimizu Toshihiro, Imai Atsushi, Takagi Tohru, Iwabuchi Mamoru, Kurihara Kiyoshi, Miyazaki Taro, Kumano Tadashi, Kaneko Hiroyuki, Sato Shoei, Seiyama Nobumasa, Yamanouchi Yuko, and Sumiyoshi Hideki. 2018. Study on automated audio descriptions overlapping live television commentary. In Computers Helping People with Special Needs, Miesenberger Klaus and Kouroupetroglou Georgios (Eds.). Springer International Publishing, Cham, 220224. Google ScholarGoogle Scholar
  18. [18] Karkar Abdel Ghani, Puthren Mary, and Al-ma’adeed Somaya. 2018. A bilingual scene-to-speech mobile based application. International Conference on Computer and Applications (ICCA’18), Beirut, 1–240. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Kobayashi Masatomo, Nagano Tohru, Fukuda Kentarou, and Takagi Hironobu. 2010. Describing online videos with text-to-speech narration. In Proceedings of the International Cross Disciplinary Conference on Web Accessibility (W4A’10). ACM, New York, NY. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Kobayashi Masatomo, O’Connell Trisha, Gould Bryan, Takagi Hironobu, and Asakawa Chieko. 2010. Are synthesized video descriptions acceptable? In Proceedings of the 12th International ACM SIGACCESS Conference on Computer Access (ASSETS’10). ACM, New York, NY, 163170. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Lakritz J. and Salway A.. 2002. The Semi-automatic Generation of Audio Description from Screenplays, Technical Report CS-06-05. Dept. of Computing, University of Surrey.Google ScholarGoogle Scholar
  22. [22] Lin T., Maire M., Belongie S, et al. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Liu An-An, Xu Ning, Wong Yongkang, Li Junnan, Su Yu-Ting, and Kankanhalli Mohan. 2017. Hierarchical & multimodal video captioning. Comput. Vis. Image Underst. 163, C (Oct.2017), 113125. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Lopez Mariana, Kearney Gavin, and Hofstadter Krisztian. 2018. Audio description in the UK: What works, what doesn’t, and understanding the need for personalising access. Brit. J. Vis. Impair. 36 (082018). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Masse Mark. 2011. REST API Design Rulebook. O’Reilly Media, Sebastopol.Google ScholarGoogle Scholar
  26. [26] Iwona Mazur. 2020. Audio description: Concepts, theories and research approaches. In Ł. Bogucki and M. Deckert (Eds.). The Palgrave Handbook of Audiovisual Translation and Media Accessibility, Palgrave Studies in Translating and Interpreting, Palgrave Macmillan, Cham. Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Mazur Iwona. 2020. A functional approach to audio description. J. Audiovis. Transl. 3, 2 (Dec.2020), 226245. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Nguyen Khoa, Drossos Konstantinos, and Virtanen Tuomas. 2020. Temporal sub-sampling of audio feature sequences for automated audio captioning. arXiv preprint arXiv:2007.02676.Google ScholarGoogle Scholar
  29. [29] Nunes E. V., Machado F. O., and Vanzin T.. 2011. Audiodescricao como Tecnologia Assistiva para o Acesso ao Conhecimento por Pessoas Cegas. (Audio Description as Assistive Technology for Access to Knowledge for the Blind). Pandion, Florianopolis, 191232.Google ScholarGoogle Scholar
  30. [30] Oliveira Rita, Abreu Jorge Ferraz de, Almeida Margarida, and Cardoso Bernardo. 2016. Inclusive approaches for audiovisual translation production in interactive television (iTV). In Proceedings of the 7th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-exclusion (DSAI’16). ACM, New York, NY, 146153. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Perera M., Farook C., and Madurapperuma A. P.. 2017. Automatic video descriptor for human action recognition. In Proceedings of the National Information Technology Conference (NITC’17). 6167. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Redmon Joseph and Farhadi Ali. 2016. YOLO9000: Better, faster, stronger. CoRR abs/1612.08242 (2016).Google ScholarGoogle Scholar
  33. [33] Façanha Agebson Rocha, Oliveira Adonias Caetano de, Lima Marcos Vinicius de Andrade, Viana Windson, and Sánchez Jaime. 2016. Audio description of videos for people with visual disabilities. In Universal Access in Human-Computer Interaction. Users and Context Diversity, Antona Margherita and Stephanidis Constantine (Eds.). Springer International Publishing, Cham, 505515. Google ScholarGoogle Scholar
  34. [34] Szarkowska A.. 2011. Text-to-speech audio description: Towards wider availability of AD. J. Spec. Transl. 15 (2011), 142162.Google ScholarGoogle Scholar
  35. [35] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott E., Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2014. Going deeper with convolutions. CoRR abs/1409.4842 (2014).Google ScholarGoogle Scholar
  36. [36] Asociación Española de Normalización. UNE-153020. 2005. Audiodescripción para Personas con Discapacidad Visual. Requisitos para la audiodescripción y elaboración de audioguías (Audio description for visually impaired people. Guidelines for audio description procedures and for the preparation of audio guides). Technical Report. AENOR. Available in: www.une.org/encuentra-tu-norma/busca-tu-norma/norma?c=N0032787.Google ScholarGoogle Scholar
  37. [37] Organization WHO - World Health. 2019. Blindness and Vision Impairment. Retrieved from http://www.who.int/news-room/fact-sheets/detailblindness-and-visual-impairment.Google ScholarGoogle Scholar
  38. [38] Yang Yuecong Xuand Jianfei and Mao Kezhi. 2019. Semantic-filtered soft-split-aware video captioning with audio-augmented feature. Neurocomputing 357 (2019), 2435. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Yue Wang, Xiaojie Wang, and Yuzhao Mao. 2016. First-feed LSTM model for video description. J. China Univ. Posts Telecommun. 23, 3 (2016), 8993. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Machine Generation of Audio Description for Blind and Visually Impaired People

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Accessible Computing
      ACM Transactions on Accessible Computing  Volume 16, Issue 2
      June 2023
      176 pages
      ISSN:1936-7228
      EISSN:1936-7236
      DOI:10.1145/3596450
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 June 2023
      • Online AM: 14 April 2023
      • Accepted: 21 March 2023
      • Revised: 26 January 2023
      • Received: 19 July 2021
      Published in taccess Volume 16, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text