Keywords

1 Introduction

According to World Health Organization [8], there are 285 million of people with visual disabilities. Among them, around 246 million have low vision and 39 million are blind. Additionally, 43% of visually impaired people have uncorrected refractive errors (i.e. near-sightedness and far-sightedness), which accounts for 105 million people worldwide.

Among the large list of daily life problems encountered by visually impaired people, three major ones can be distinguished: text comprehension, mobility, and social interaction. Reading a text is reported to be the most common problem among people with low vision. Around 66% of people with bad eyesight complain about difficulties with reading [2], which makes it the leading problem. Furthermore, blind people struggle to recognize their surroundings [15]. As a consequence, a large portion of people with poor vision complains about performing normal in-home activities and their low mobility (15.1% and 16.3%, respectively) [2]. Additionally, around 10% of visually impaired people claim that they do not recognize faces, thus, having problems with social engagement. Moreover, understanding other people’s emotion is a challenge. Aforementioned problems pose a difficulty for visually impaired people to work efficiently since most of the jobs require a relevant level of eyesight.

To overcome the mentioned challenges, we propose a portable system which gives information about the surroundings (i.e. image captioning and visual question answering), performs obstacle detection, recognizes people and their emotions, reads a text, and implements an automatic size measuring module for the objects of interest. To sum up, the contributions of this work are the following:

  • Intelligent system for people with low vision abilities.

  • Distributed system structure allowing near real-time performance.

  • Depth question answering algorithm to measure objects size automatically.

  • All the relevant codes to reproduce the system are made availableFootnote 1.

This paper is organized as follows. In the next Section, we provide an extensive literature review of recent research works and existing systems developed for visually impaired people. The proposed methodology and existing modules overview explanation are introduced in the Sect. 3. Finally, a large number of experiments is proposed in Sect. 4 followed by a brief conclusion (Sect. 5).

2 Related Works

In this section, we report the existing research works and developed systems that aim to improve the life of visually impaired people. We mostly focus on recent works (i.e. starting from 2012) which rely on deep learning.

In recent years, mobile solutions for visually impaired people have gained in popularity. A good example of such work is smartphone based obstacle detection [12, 26] which aims to increase the mobility of sightless people. The main emphasis of these works is to detect and classify obstacles in front of the user. In a different work proposed by Wang et al. [17], a wearable system that consists of a depth camera and an embedded computer is utilized to provide situational awareness for blind or visually impaired people. The system performs obstacle detection and notifies the user about empty chairs and benches nearby. Additionally, Mattoccia et al. [28] have proposed a system that is built upon usage of 3D glasses for object detection. The authors use 3D sensor installed on glasses frames to obtain a depth map of the environment for performing obstacle detection. Since it is commonly reported that visually impaired people have difficulties with face recognition, Neto et al. [21] have proposed a solution which relies on Microsoft Kinect as a wearable device. The key point of this work is to help users to distinguish people by specific sounds associated with them. In addition, this sound is virtualized at the direction of the corresponding person.

In addition to academic research laboratories, several companies have tried to simplify the life of visually impaired people by introducing relevant systems. For example, Microsoft Cognitive Services [4] has developed a system that includes a headset of Pivothead SMART glasses and a host application that transfers pictures from glasses to a smartphone. A Cognitive Services API is utilized to detect faces and facial expressions and to determine the gender and age of the person. This system also performs image captioning (i.e. the description of the photo) and text recognition. On the other hand, NVIDIA has proposed a system called Horus [3, 7] that incorporates headset with cameras and pocket-computer powered by NVIDIA Tegra K1 for GPU-accelerated computer vision. They use deep learning techniques to recognize faces and objects, read a text, and perform image captioning. Another important part of the system is the obstacle avoidance using a stereo camera built in Horus. All the commands are given through a controller pad embedded in a pocket computer, while the output is given in a speech format through speakers on the handset. Finally, there exists a smartphone application called Be My Eyes [1] that is created to associate visually impaired user with a sighted volunteer through a video connection. Using this application sighted helpers provide an aid for people with visual disabilities in everyday activities. Even though this type of interaction remains helpful, the technological solution can provide far more satisfying time performance.

Generally, it is observed that the recent research works have concentrated on solving particular problems rather than on creating a unified system that will include all modules crucial for blind people. In this paper, we propose a multipurpose system that incorporates all these modules that solve the most important problems for blind or visually impaired people.

3 Methodology

In this section, we provide a detailed description of the system and available modules. Firstly, in Subsect. 3.1, we briefly describe the proposed system from hardware and software point of view. Subsequently, specific implementation details about every introduced modules are covered in Subsects. 3.2–3.8.

Fig. 1.
figure 1

Server-application interaction architecture of the proposed system.

3.1 Detailed Overview of the Proposed System

The proposed system pipeline is composed of three consecutive steps: sending the request to a smartphone application, transferring the data to a server, and receiving the output back to a smartphone application. The pipeline of this process is shown in Fig. 1. Our system can process two different types of images. Namely, monocular images acquired from a Sony SmartEyeGlasses and stereo images obtained from a ZED Stereo Camera. The image acquisition is followed by a user voice request. These two inputs are collected by smartglasses’ host application (i.e. smartphone) and are sent to the server. Finally, the server produces relevant outputs that are converted to text format and transferred back to a smartphone application that displays the text on the screen and produces the voice output that is delivered through the speaker or earphones.

To reduce the computational time, a cloud server is responsible for the main computational load. Our cloud server accepts an image (e.g. single or stereo image) as well as a user request as a text. The user request can be given in a form of a question or a command which is translated to a string using [30]. Then, this string is forwarded through a Long Short-Term Memory (LSTM) [27] network that classifies which module is the most appropriate to solve the requested task. After that, the collected image is used as an input for a selected module. Finally, the selected module generates a specific output which is sent back to the user (Fig. 2).

Fig. 2.
figure 2

Modules architecture on the server.

3.2 User Input Classification

One of the main contributions of this paper is the way we distribute the different tasks to the proper modules. This distribution is crucial in terms of usability since it allows the user to create unique requests for the system rather than operating with a set of predefined questions made beforehand. Therefore, the goal of our input classification module is to robustly understand which module is asked to be activated by the user’s request. Therefore, in order to perform the sentence classification, we have utilized LSTM network. This decision is motivated by the ability of this architecture to save long-term dependencies throughout learning. The network consists of two LSTM layers and one fully connected layer followed by a softmax. To avoid overfitting, we have utilized two dropout layers: one between LSTM layers and one between LSTM and fully connected layers. The network is fed with sentence representation vectors that we have obtained with fasttext [24]. This architecture is able to achieve a high level of accuracy to robustly classify user requests. The details about prepared dataset as well as training schemes are reported in Subsect. 4.1.

3.3 Depth Question Answering

In order to be able to apprehend the size of the surrounding objects, we have developed a fully automatic size measuring system for a smartphone that works in near real-time. The basic idea is that once the stereo image of the scene is obtained, a user can simply request the system to take the desired measurements of the object of interest through voice commands. The stereo image is used to obtain a dense depth map [10]. In a meantime, the object recognition algorithm proposed by Liu et al. [35] detects and recognizes the objects in the left image of the stereo pair. Finally, given the user request as a string, we try to retrieve the words of interest in this sentence. For this purpose, we retrieve all the nouns of the input using Stanford Parser [6] and check their statistical similarity to the label of recognized objects in the image by relying on word2vec [33]. The objects whose labels are matched with the retrieved words of interest from the user request with a similarity over 50% are kept while the rest are removed from further consideration. The overview of the proposed module is shown in Fig. 3.

Given a potentially noisy depth map of the scene and the location (i.e. bounding box) of the object of interest, the distance to the object can be estimated by averaging the depth values of the pixels that belong to a recognized class within the bounding box. In order to retrieve those pixels we utilize FCNs [20] which provide a relatively precise segmentation mask with near real-time performance. Then, we select the middle point on each side of the bounding box (left, right, top, bottom) and obtain the 3D positions of these points by assuming that the object lies on a single plane. This approximation remains relatively accurate in most scenarios. Thereafter, the Euclidian distances between these extremum 3D points are used to estimate the height and width of the object.

Fig. 3.
figure 3

Depth question answering algorithm pipeline.

It is worth mentioning that there are several works that aim to measure the size of objects in an unconstrained environment such as Google Tango Project [14]. However, in Project Tango, a user is required to manually select points (boundary selection) on the object of interest. On the other hand, our system mitigates this constraint and proposes a fully automatic measuring device that can be appropriated for blind or visually impaired people.

3.4 Obstacle Avoidance

Navigation and obstacle avoidance is one of the main problems for people with visual impairment. Our system focuses on detection and timely notifications of possible obstacles observed in front of a user. The algorithm is based on depth map estimation using a stereo camera. First, the pair of images is fed to an algorithm proposed by Geiger et al. [10]. However, due to the algorithm simplicity, the obtained disparity map is noisy which might result in a wrong obstacle detection. Therefore, in order to smooth out noisy responses, we apply a median blur filter on the disparity map. Further, in order to improve the robustness of this module, we select the 100 closest points to the user (from depth map) and average them to obtain the position of a probable obstacle. If the calculated value is smaller than one meter, we estimate a relative location of the obstacle. Specifically, the module outputs whether there is an obstacle on the left, right, or front relative to a user. The proposed system works in real time with a processing speed of about 1 fps.

3.5 Emotion Recognition

People without eyesight disabilities perceive other people’s emotions by their facial expression and change in intonation. Since it is impossible for visually impaired to perceive facial expression, we propose to include an emotion recognition module to our system. For this purpose, we have used a relatively shallow convolutional neural network shown in Fig. 4. This network has been designed in order to avoid overfitting since available datasets are relatively small. Our network contains three convolutional layers, two max-pooling layers, and three fully connected layers followed by ReLU.

In order to recognize a particular emotion, our method requires cropped faces fed into it. Therefore, prior to emotion recognition, we detect all the faces in the input image. The performance of face detection algorithm influences the accuracy of the whole emotion module. To robustly detect faces we rely on the algorithm proposed by Huaizu and Miller [19] which in essence is Faster RCNN [29].

Fig. 4.
figure 4

A shallow CNN used for emotion recognition.

3.6 Face Recognition

Recognizing friends or a familiar person is a very challenging task for visually impaired people. We tackle the problem from the perspective of facial recognition and perform the following procedures in order to robustly recognize a particular face. The first step to robustly recognize faces is to locate and crop each face on a given image. For this purpose, we utilize a face detection algorithm described in Subsect. 3.5. Subsequently, each detected face is frontalized (aligned) using the algorithm proposed by Hassner et al. [31]. This algorithm uses facial landmarks to align detected face with a defined 3D facial model. Then, rectified face is fed to FaceNet [16] which produces 128-dimensional feature vector. There are numbers of different ways to perform a comparison of face vector embeddings. However, under the assumption that each user has a relatively small number of friends (less than few hundreds), L2 norm comparison is expected to be fast and perform efficiently enough for a real-time application. Specifically, we compare the extracted face vector with existing entries in our database.

3.7 Image Captioning and Visual Question Answering

The image captioning module provides a broad description of the photo including the objects that are in the image and their interaction. The urge for such module arises from the need of visually impaired people to get a general feedback about their surroundings. For this purpose, we utilize the Show and Tell algorithm [23] which is a type of encoder-decoder neural network. At the first stage of processing the image, the network encodes it into a fixed-length vector representation which is subsequently decoded into natural language. The encoder network is the state of the art deep convolutional neural network for object detection and recognition. In this work, we have used Inception v3 model described in [13]. LSTM network represents the decoder. On the other hand, to help visually impaired people to orient in their surrounding environment, we have integrated a visual question answering module into our system. The module is built using state-of-art architecture proposed by Fukui et al. [9]. First, the model preprocesses the given question and image using LSTM and ResNet512 followed by a multimodal compact (MCB) bilinear pooling and visual attention techniques. Then, visual and text features are transformed to a single 16,000-D vector by feeding them to MCB pooling. Finally, treating the last part of architecture as a multi-class classification, the model retrieves most probable answers.

3.8 Text Recognition

The final module that is present in our system is text recognition. This module implements Optical Character Recognition (OCR) algorithm described by Barber et al. [34]. In order to distinguish individual characters, shape’s features are used. For this purpose, we have utilized the pytesseract [5] OCR implementation. Despite being relatively fast, this algorithm is only applicable to a black text written on a light background.

4 Experiments

In this section, we report our experimental results only for user input classification, depth question answering, and emotion recognition. Thereafter, we provide an extensive evaluation of the whole system and its time performance.

4.1 User Input Classification

In order to train the network, we have prepared a dataset composed of 590 questions aiming to robustly differentiate between existing modules given a user request. The dataset has been prepared manually under human control. Originally, we have created 460 questions, which we have extended further by performing a data augmentation. Specifically, we change the order of words in a sentence until the meaning of the sentence is still recognizable. This technique makes our algorithm to be more robust and allows to classify better grammatically incorrect sentences. In fact, this way of data augmentation can improve the performance of any training process that includes LSTM network.

Since LSTM network requires an input being a vector, we have utilized fasttext [24] to obtain a text representation in a form of 300-dimensional vector. Considering the input size, we have limited it to a sentence of twenty words maximum. Furthermore, if a number of words in a certain sentence is less than 20, we fill empty entries with zeros. Therefore, the input shape for this network is \(20\times 300\) where the row value stands for words and columns represent each word features. In order to precise the module that has to be executed on the user request, we have utilized two LSTM layers with output space of 32-dimensions for the first layer and 16 for the second one, followed by one fully convolutional layer with a softmax. During the training, we have used Adam optimizer with an initial rate of 0.001 and a batch size of 256. Dropout ratio is set to 0.3. Training data constitutes 80%, validation set comprises 10%, and testing - 10% of the dataset. Upon testing, the network has reached a testing accuracy of 96.2%.

4.2 Depth Question Answering

In order to evaluate this module, we have selected about 50 images with various objects of interests. Then we have applied the algorithm to retrieve the information about the size of the objects. The reported average absolute error of measurements is within 15%.

Representative results are shown in Fig. 5. For these particular images the predicted and real values are reported in Table 1. While our method provides relatively accurate measurements, there are obvious cases when the method fails. First of all, the measurement strongly depends on a tightness of the bounding box around the object of interest. In most of the cases, the bounding box is larger than the real object, resulting in overestimated values for width of the Scooter in Table 1. Secondly, another error comes from the fact that we assume that the object lies on a plane which in some scenarios might not be the case.

Fig. 5.
figure 5

Visual results of depth question answering module. The value of the top side of the bounding box represents distance, left - height, and bottom - width.

Table 1. Comparison of predicted vs real measurements values for several objects.

4.3 Emotion Recognition

In order to train this module we have utilized three distinct datasets: Radboud [22], Cohn-Kanade [25, 32], and FER-2013 [18]. By combining these datasets we make the training data more diversified. Moreover, the Radboud dataset includes five different angles of face rotation for each emotion, thus, our algorithm can perform better in a real-case scenario. Altogether, these datasets contain 27100 images. Then, we have performed image mirroring as data augmentation method resulting in doubling the number of images (54200 in total). This combined dataset is divided into training (60%), validation (20%), and testing (20%) in a way that an original image and a mirrored one appear in the same set. During the training, we apply Adam optimizer with the initial learning rate of 0.0001 and a batch size of 2000 images where each grayscale image has a size of \(48\,\times \,48{\text {p}}\). The dropout ratio is set to 0.5. The training is run for 120 epochs.

Table 2. The confusion matrix for emotion recognition.

The evaluation results on the testing dataset are presented as confusion matrix in Table 2. In this matrix, the cell at the ith row and jth column represents the percentage of ith emotion being recognized as jth emotion. It can be clearly observed that in our case, the best recognized facial expressions are disgust, happiness, and astonishment. Generally, these emotions are easily distinguishable among themselves making them easy to be recognized by the network. On the other hand, emotions such as sadness and fear might look similar to each other and to anger, making them challenging to be recognized. As a result, our network struggles to derive discriminative features to robustly identify these two emotions. Overall, we have reached 78.4% of accuracy on a testing set.

4.4 Overall Evaluation of the System

The crucial characteristic of our system is its usability. To be able to serve efficiently the time cost of using the system has to be relatively low. To estimate the average executing time of modules we sent 20 requests to each of them and collected the response time. Table 3 illustrates the average time needed to get a response from each module in the proposed system. Firstly, we have checked the response time only for the algorithms themselves, apart from the application. This time is, in fact, an execution time of each algorithm. The biggest response times are observed for the Question Answering and for Face Recognition since the networks used for them have a high time cost of usage. Secondly, we measured the response time of each module when requested from the application. In fact, that is the time between the user input is uploaded to the server and the time the response appears on the screen of a smartphone. The time gap between the algorithm and the application responses is different for each module because it depends on the size of generated output sentence.

Table 3. Time performance evaluation of each module of the system.
Fig. 6.
figure 6

Sample images used in the survey. First row shows original image, while the second row represents the blurred images that are shown to a respondent.

Furthermore, to test the applicability of the system we have conducted a survey that included answers from 25 sighted people. In order to put them in similar conditions as visually impaired people, we have shown each volunteer four blurred pictures where the scene can hardly be recognized on a computer display. The images are selected from the dataset of 20 pictures in a way that people have to send a request to each module to understand what is happening in the images. For instance, each set of four images contains a picture with a text in it for participants to evaluate Text Recognition module. An example of the utilized images can be observed in Fig. 6. We have allowed the participants to ask three questions per image in order to understand the scene using our system only. For each image, we have given them a context of the situation to help them ask relevant questions. However, all the questions have been made by volunteers and have been created in a way that they can obtain as much information as possible from the answer. For instance, for the image with people sitting around the table in Fig. 6 participants frequently asked about what people are doing there and about the number of people in the image. Thereafter, we have shown them original images and have asked them to evaluate our system on a 10-band scale considering the answers they have received back from the system. Our survey (see Table 4) consists of the following criteria: user input classification (UIC) accuracy (denoted as UIC Accuracy); how informative is the content of the answer (Contents); person’s satisfaction (Satisfaction); how good is overall system performance (Performance); time efficiency of the system (Time Efficiency). Important to notice that this survey covered all modules except for obstacles detection and depth question answering since these experiments require walking with the stereo camera. In this case, experimenting with blindfolded sighted people produce biased results according to Postma et al. [11].

Table 4. Survey results about the performance of the proposed system. The scores in this table are the average scores of all collected samples.

The obtained results reveal that people are noticeably satisfied with the working speed of our system and with the user input classification accuracy. Overall, respondents feel positive about the whole system, although their level of satisfaction is relatively lower due to inaccuracies that have occurred in the system response. These inaccuracies include answers lacking the information or the incorrect ones. However, the results also indicate that the content of the answers is considered to be informative. Thus, the content is scored slightly higher than the satisfaction. Moreover, our survey demonstrates that the Face Recognition module outperforms other modules in terms of people’s satisfaction. In contrast, during the user study the Text Recognition module has shown the worst performance since the technology we use (see Sect. 3.8) has comparatively low accuracy and stability. It is worth mentioning that the Image Captioning module is the second worst module by the contents and satisfaction. It is explained by the fact that this module sometimes omits the information that can be useful for visually impaired people. For instance, it might ignore an obstacle in front of the user, but explain the actions of people nearby.

5 Conclusion

In this work, we proposed a complete system that targets major challenges of blind or visually impaired people. The system is composed of smartglasses with integrated camera and microphone, a stereo camera, a smartphone that is connected with smartglasses through a host application, and a server that serves the purpose of a computational unit. Our system is capable of detecting obstacles in the nearest surrounding, providing an estimation of the size of objects of interest, recognizing faces and facial expressions, reading the text, providing the generic description and question answering of a particular input image. We conducted series of experiments which proved the applicability and usability of our system for blind or visually impaired people.

Although each module of our system was implemented efficiently, the image transfer speed from Sony SmartEyeGlasses to host smartphone is relatively slow making it unpractical for the real-time application. Therefore, as a future work, we want to explore alternative devices as well as to improve the user interface for a comfortable usage. Another major improvement that can be done is to include background working function, i.e., allow to send the request in advance in order to avoid uncomfortable situations. For example, asking for a person’s name in front of this person or sending any voice request in a crowded area.