Keywords

1 Introduction

Modern society experiences a time of intensive access to digital media. Entertainment, education, work, in short, almost all daily activities of the modern citizen are influenced by digital media. One of the most successful media is digital video, which is broadcasted on digital TV, shared on social websites (e.g., Vimeo), or distributed on physical media as DVD and Blu-Ray.

Most digital videos, however, are inaccessible to people with visual impairment. According to the World Health Organization, 285 million people with visual disabilities are estimated in the whole world. 40 million of them are totally blind. The majority (90 %) live in developing countries, such as, Brazil and India [5]. These people face daily difficulties. Nowadays, they also have to deal with a socio-cultural exclusion from the digital world, since they do not have guarantees of access to visual content. The majority part of the productions released in digital media (e.g., films, TV series, and popular Youtube videos) does not have accessibility features.

An attempt to reduce this exclusion recommends the addition of a narration of the scenes. This technique is known as audio description, and it adds a second audio to the original one. Audio description can be defined as an activity of linguistic mediation, a mode of inter-semiotic translation that transforms visual to verbal, allowing for a comprehension of scenes when there is no dialog between personages. The idea is to add descriptions of the video visual elements, which are essential for understanding the dialogues and scenes that contain a strong visual character [6]. However, many videos do not have this accessibility feature, due to the economic costs of the audio description production.

Some initiatives exist to provide audio description of films and digital television shows. In many cases, the audio descriptions are made grace to accessibility laws (e.g., Brazil federal law nº 5.296). However, these initiatives contemplate a small portion of digital videos. This problem is even greater if we analyze the digital videos available on the Web (e.g., news websites, video sharing platforms like Youtube and Vimeo).

For a task of this magnitude, a possible approach is to rely on the help of virtual communities, and use systems that assist the process of creating and playing audio description. Sites for collaborative creation of movie subtitles illustrate the potential of the approach. The Opensubtitles.org site is one such example. However, the quality of the audio descriptions should not be compromised in this process.

In recent years, the synthesized voice technologies showed strong growth in performance and effectiveness. They provide the “transformation” of a text into “human” voices. Text-To-Speech (TTS) software is a key component in the functioning of accessibility technologies targeting users with visual impairment, such as: screen readers [2], software to aid Web navigation [2], virtual keyboards [3], and, accessible digital games [1, 4].

In this context, our research goal is to explore the voice synthesized technologies in order to maximize and accelerate the production of audio descriptions. With this approach, we believe it will increase the number of accessible digital videos to the visually impaired.

In this paper, we present the conception, development and initial usability evaluation of a software suite for audio description. We developed a video player– ADVPlayer – that synchronizes the video with a description script (textual file similar to SRT, which contains metadata such as: description time, configuration of the synthesized voice, and scene description). In fact, ADVPlayer executes a second audio generated from a TTS, which is a speech version of a description script. This is a more flexible and low cost approach comparing with the conventional audio description production process (i.e., with human voice recording).

The remainder of this paper is organized as follows: Sect. 2 presents related work on audio description. The ADVPlayer suite, and its design, and development processes are described in Sect. 3. Section 4 presents the usability evaluation implemented with students with visual impairment from two South American countries. Finally, we conclude the paper with final considerations and future work in Sect. 5.

2 Related Work

Research on the development of assistive technologies for digital video is relatively recent. They propose to minimize the task effort of audio description [810]. For instance, researchers of the work [8] recommend the use of a tool to support the audio descriptors’ work: the Subtitle Workshop software. This tool is focused on producing video subtitles. Subtitle Workshop uses the SRT (Subrip subtitle format) for storing the audio description information. The researchers’ goal was to speed up the process of traditional audio description by using their tool. However, the approach is still requiring a human audio description recording.

Researches of Kobayashi et al. focus on removing the audio description recording phase [6]. The authors analyzed the acceptability of speech synthesizers for this task. Analyses were performed with audio in English and Japanese. They conducted a major study of 115 participants with visual impairment during an exhibition of assistive technologies in Japan. The aim was to compare the quality of human audio description with the quality of synthesized audio description. Despite the superiority of human narration, most participants admitted that their quality of life could be enhanced with synthesized narrations. Good digital tools following this approach could increase the number of online video with audio description.

Our research follows some of the Kobayashi’s findings. We propose a set of tools that assist the production, sharing, and execution of synthesized audio description.

3 Synthesized Audio Description

We propose tools to support a Synthesized Audio Description approach. Our research provides audio description with a lower cost because it does not require studio recording of the audio description voices. The approach relies on speech synthesis of what should be narrated. The main objective is to facilitate understanding when no audio dialogues exist between characters, similar to what happens in traditional audio description. The key component of this proposal is a video player that executes a second audio by using a TTS synthesizer. ADVPlayer supports AVI video files (Audio Video Interleave), which originally does not have accessibility features. ADVPlayer synchronizes the original digital video with speech synthesis of a text file. This file contains the audio description and their execution times.

Figure 1 shows the main components of ADVPlayer and its execution process. The system input is composed of two files: a text file with audio descriptions (.adv), which follows the standard used by movie subtitle files (SRT), and the original video file. ADVPlayer has three main modules: (i) a digital media player, to read and reproduce the original video; (ii) a synchronizer to read the descriptions and to invoke the speech synthesizer, which will turn them into voice, and (iii) the TTS software itself.

Fig. 1.
figure 1

ADVPlayer components

According to Zuffo and Pistori [7], the speech synthesis is the generation of sound signals that reproduce the equivalent words in a given natural language. Synthesize voice applications are designed to imitate human speech. In the case of ADVPlayer, we used the eSpeak TTS for this task.

3.1 Audio Description Text File (.ADV)

The description file (.adv) contains the descriptions that provide a good video understanding by people with visual impairment. During a video execution, ADVPlayer reads (.adv) text and passes it as parameter to the eSpeak TTS, following a pre-established synchronism time. The text is converted into synthesized speech, and, then, synchronized with the video.

The ADV format is similar to Subrip Subtitles format (SRT)Footnote 1. The SRT has four parts: (i) the subtitle number; (ii) the time that the subtitle appears on the screen (in hour: minute: second: millisecond format), and the time it will disappear; (iii) the subtitle text; and finally (iv) one blank line, which has the purpose of separating subtitles.

We extended the SRT format in order to support synthesized audio description. We added a new line, which contains two parameters: the synthesized speech rate and the volume of the synthesized voice (line N of Fig. 2). The speech rate parameter indicates the execution speed of the audio description. Its acceleration can avoid overlapping of dialogues. This feature also acts to accelerate longer descriptions in shorter periods of time, without, however, losing the audio description understanding (very useful in audio description of action movies).

Fig. 2.
figure 2

ADV file example

People with visual impairment are habituated to listening to the synthesized voice at fast speeds. The volume parameter allows for overlapping the environment sound of the original video. Thus, the synthesized audio description can be better understood. Figure 2 shows an excerpt from an ADV file containing an audio description example.

W3C proposes TTMLFootnote 2 as standard to represent timed text on the Web. However, we chose to extend SRT, since there is on the Internet a large community of people who produce subtitles of movies, series, and other videos. All these virtual communities are familiar with the SRT format. Our hypothesis is that it will be easier for them to collaborate in the creation of synthesized audio descriptions.

The use of files “.adv” makes possible a constant improvement in the quality of audio descriptions. Changes can be made only by modifying the content of “.adv” files (descriptions, synchronization time, and/or volume). This is quite different from the recorded audio description. Upon completion of the audio description and the video distribution, one cannot update the descriptions without dealing with a substantial financial impact.

3.2 ADVEditor and ADVPlayer

The “.adv” files can be written with any text editor (e.g., Windows Notepad). Nonetheless, we created a tool, named ADVEditor for assisting this editing process. This application allows for inserting textual descriptions and the specification of the speed, volume, and tone of the synthesized voice. The tool includes a video player to facilitate the visualization, testing, and edition of the order of the descriptions inserted.

Figure 3 shows a ADVEditor screenshot. The tool was developed using Java SE. The Standard Widget Toolkit (SWT) was used to implement graphic elements accessible to screen readers. We also used the DirectShow Java (DSJ) API for media reproduction.

Fig. 3.
figure 3

(a) ADV Editor screenshot; (b) ADVPlayer playing Rocky movie; (c) a blind user testing our approach

Descriptions should always be positioned in times when there is no dialogue between the characters, giving a better understanding of the video by the visually impaired. ADVEditor assists in this process by providing the audio description editor test and simulation features.

Once created the “.adv” file, the user can open the ADVplayer program and provides information on the video he wants to run in sync with the audio description. The ADVPlayer tool synchronizes the video execution with reading the descriptions. In addition, it uses the user’s preferences specifications to change the type and volume of the synthesized voice. The ADVPlayer tool was developed using the same ADVEditor technologies.

Figure 3 also shows the ADVPlayer software running. After the ADVPlayer program is opened, the user must: (1) load the description file (.adv); (2) point out the media that contains the video to be played; and (3) select the play button to start playing. The synthesized audio description system will be enabled automatically once the ADVplayer program is loaded.

ADVplayer is a free software and requires for its operation the Windows platform (Windows XP or above). It has its current version in beta testing and is used by a small group of people with visual impairment. The tool can be downloaded at: http://adv-acessibilidade.rhcloud.com.

3.3 Improving the Quality of Audio Description

We have created a web pageFootnote 3 to promote communities of audio description editors. The goal is to ensure the maintenance of quality of descriptions, and increase the available number of films and videos (e.g., translations into other languages).

On the website, people can share new descriptions, propose improvements, and qualify the descriptions already available (through a ranking system). Our website follows the W3C accessibility guidelines.

In this way, we hope that in a short time the number of available titles will grow rapidly through the power of virtual communities. This behavior is already happening in web sites that provide video subtitles (movies, series and other videos).

4 Usability Evaluation

We performed an initial usability testing to determine the technology acceptance and get new ideas and suggestions for improvements of the ADV Player prototype. We use a 3 min video of the film Monsters, Inc. to acclimate users. Then, we used focal group techniquesFootnote 4 to discover, analyze, and validate it by observing users. We perform sessions with eight people with visual disabilities, which had distinct levels of understanding of the topic addressed. This initial evaluation revealed a high comprehension and acceptance in terms of satisfaction and confidence with synthetized audio description.

The ADVEditor prototype was introduced to three audio description specialists using the brainstorming techniques to know how they make the description and their interests in producing descriptions using the application. Diverse suggestions to improve the tools were highlighted. We corrected some aspects of our applications and made its internationalization for Portuguese, English and Spanish.

After taking into account these aspects, we implemented an assessment usability test of ADVPlayer. Tests were made with 19 students with visual disabilities in two South American countries (Brazil and Chile). The tests seek to measure the effectiveness of the ADVPlayer implementation and also to quantify the acceptance of synthesized audio description.

4.1 Sample

The sample was composed by 19 students with visual impairment. The sample was non-probabilistic, selected for convenience. They were students of three specialized institutions on people with visual disabilities: Escuela de Ciegos Santa Lucia, the Col-lege Hellen Keller and the Blind Association of Ceará). The first school is Chilean and others are Brazilian schools.

The sample has:

  • 8 men and 11 women

  • Age from 6 to 47 years old

  • 4 were totally blind e 15 had low vision

12 users had prior knowledge of what was an audio description, although, only seven users have watched a video with this feature. Six of the users reported watching movies with audio only. 4 of them were completely blind and 2 had low vision (with serious vision impairment). Other users have reported watching videos with both image and audio features.

Regarding the use of TTS, 9 of them did not use screen readers, eight use the JAWS screen reader, and two of them prefer the NVDA reader, always with male voices.

4.2 Materials

We decided to use slices of 15 min from the beginning of films to evaluate the ADVPlayer. The goal was to reduce the time of the test sections. We created .ADV files of three films: “The Book Thief”, “The Fault in Our Stars”, and “Inside Out”. All descriptions have been prepared by people with training in scripts for audio description, supervised by a consultant with visual impairment. Figure 4 shows the group of descriptors using the ADVEditor. The complete audio descriptions of the movies are now available at the project Web site.

Fig. 4.
figure 4

Audio descriptors are writing and reviewing (.adv) files on ADVEditor

4.3 Instruments

In all sessions, we administered a questionnaire with four parts. The first one has a pre-collection questionnaire with personal information to draw a profile of the target users studied (results are summarized in Sect. 4.1). The other parts are related to pos-collection, usability feedback, and movie comprehension.

The questionnaire was developed and implemented in order to discover, analyze and validate, from observations of the user group, requirements for the improvement and implementation of ADVPlayer software. The pre-collection allows for knowing each user’s profile. The post-collection measures the impact of using audio description with the synthesized speech.

Usability part of the questionnaire was designed following the works [11, 12]. The questionnaire was created guided by the following metrics:

  • P1 - The audio descriptions are clear and objective, avoided me an erroneous interpretation of the scene. (CONFIDENCE)

  • P2 - The amount of information transmitted during the silence interval is appropriate. (SATISFACTION)

  • P3 - The audio volume of the audio descriptions are appropriate to each situation. (SATISFACTION)

  • P4 - The audio tone is appropriate for each situation. (SATISFACTION)

  • P5 - The type of TTS and the speed rate of the synthesized voice are appropriated to the original movie audio.

  • P6 - The audio description allows imagining the movie scenes. (RELEVANCE)

  • P7 - Watch the movie with synthesized audio description convinced me to watch other movies with this feature. (MOTIVATION)

  • P8 - I feel more pleasured watching the video as a result of the audio descriptions.

  • P9 - I can understand information that was previously only visual. (CONFIDENCE)

To measure the degree of agreement, we used a Likert scale, ranging from 1 (completely disagree) to 5 (completely agree).

The last part of the questionnaire asks about users’ understanding on the movie. It contains questions about the environment of the scenes, the time when the story happened, what and who were the characters, and the identification of the clothes they wore.

4.4 Procedure

The tests were implemented according to the following methodology:

  1. (1)

    Each session has 4–6 students, who were informed about the purpose of the synthesized audio description, and how ADVPlayer works;

  2. (2)

    They chose one of the three films for the group session;

  3. (3)

    Next, the group watched a stretch of about 15 min of the film;

  4. (4)

    At the end, we applied to each of them a questionnaire to get their opinion, usability feedback, and understanding of the film.

The main purpose of this structure was to obtain qualitative results of validation and acceptance. The perceptions reported by participants during the sessions can lead us to a better implementation of the tool in the school environment. Additionally, this structure allows for elucidating new requirements of the proposed software. Figure 5 illustrates one of the sections.

Fig. 5.
figure 5

Two photos illustrating the usability section implemented in Chile

4.5 Results

From the observations and reports obtained from participants in the experiment, we realized that there is a high satisfaction and confidence in the use of audio description technology with voice synthesis. We can highlight as relevant points of the questionnaires results:

  1. 1.

    94.7 % of users understood the audio description for the section film (P1 and P2) and, more specifically, they were able to understand information that was previously only visual (P9).

  2. 2.

    The audio volume (P3) and the type of TTS (P5) were points of disagreement among some users (10.5 % and 31.5 %, respectively). This indicates the need for greater flexibility in movie playback. We are already working to add voice choice and speed in ADVPlayer.

  3. 3.

    The use of the synthesized audio description stimulated the creativity of users (P6) and the search for other videos with audio description (P7); which is a technology acceptance indicator.

  4. 4.

    We realized that the lack of experience of some users with TTS technology had an impact on the assessment of the speed of the voice. 73.68 % of users reported the voice speed was normal, but, 26.3 % of them considered it too fast.

5 Discussion and Conclusion

Studies on synthesized audio-description are still incipient, especially, research on romance languages (e.g., Portuguese, Spanish). However, it is patent this approach is a good alternative to make accessible a lot of visual productions, such as films, series, documentaries, concerts, for people with visual impairment.

The main features of our proposal are: (i) we provide a collaborative creation of the audio-descriptions, since it is based on virtual communities; (ii) the approach allows a constant improvement in the quality of available descriptions; (iii) ADVPlayer uses various types of synthesized voices; and, finally, (iv) we allow users to customize the running speed of the audio description.

Initial usability results of our approach indicate that most users comprehended the information described during the video, they could understand information that before was only visual. The voice speed and the audio volume were aspects of disagreements of users indicating a need for a higher flexibility of these functionalities in the video executions. Also, the use of audio description with synthesized voice stimulated the creativity of diverse users and the search for other videos with audio description evidencing the technology acceptance.

The research is in progress now considering port ADV Player to mobile platforms, such as Android and iOS. As future work, we also plan to make ADVPlayer Web versions. Regarding ADVEditor, we are planning to build a recommendation feature able to detect, through video content processing, silence moments adequate to insert audio descriptions.

Furthermore, a collaborative web site to share description files is being developed. It allows for sharing diverse films, making them fast accessible. The approach opens more options to this population to access to digital media, which contributes to their cultural, social and school inclusion. Finally, we believe that research on synthesized audio description may expand its potential for other populations, such as people with intellectual disabilities, elderly, and dyslexics.