Read It Aloud to Me

Celaschi, Sergio; Castro, Mauricio Sol; da Cunha, Sidney Pinto

doi:10.1007/978-3-319-58703-5_19

Sergio Celaschi¹⁵,
Mauricio Sol Castro¹⁶ &
Sidney Pinto da Cunha¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10278))

Included in the following conference series:

International Conference on Universal Access in Human-Computer Interaction

1493 Accesses
1 Citations

Abstract

The universal design applied to assistive technologies can help visually impaired person perform some day-to-day tasks as well as everybody. With this aim, the present work focuses on development of photo-to-speech instruments for the visually impaired person. It allows the user to hear text typed on a sheet of paper or written/posted on a wall. To achieve that aim a set of image capture and processing frameworks such as Optical Character Recognition (OCR) and Text to Speech Synthesis (TTS) were integrated. The first versions of the OCR based speech synthesis systems were developed for our native language, Portuguese. A preliminary desktop version was designed under Windows OS, and a version for mobile devices was developed as an application for Android devices. In this paper, we summarize efforts to develop and test a desktop and a mobile version of autonomous photo-to-speech instruments for the visually impaired. The project consisted of integration of selected components, and the CPU applications governing several functionalities: capture of images by the CCD camera; image preprocessing; OCR framework for text recognition; and finally the process of TTS, producing a synthesized voice.

You have full access to this open access chapter, Download conference paper PDF

Assistive System for the Blind with Voice Output Based on Optical Character Recognition

Text Vocalizing Desktop Scanner for Visually Impaired People

Hardcopy Text Recognition and Vocalization for Visually Impaired and Illiterates in Bilingual Language

Keywords

1 Introduction

Many software and applications are especially dedicated to visually impaired person, some that improve accessibility like Talkback (Android) and Voice Over (IOS), and screen readers. Others example are: ZoomReader (iOS) [1], o CapturaTalk [2], Google Translator [3], LookTel Money Reader [4] (Money reader-iOS), TapTapSee (iOS) [5] that identify objects to blind person out loud, a Brazilian Portuguese TTS Alcance-CPqD [6], helping blind person to access a lot of smart phone services. See the appendix A, for a list of applications. The present paper describes the development of photo-to-speech instruments for visually impaired persons [7] with the following prerequisites: to be free of charge to users in general, easy to install and easy to use, available locally in the country, useful and very understandable in Brazilian Portuguese, locally processed without internet and cloud connections, low memory space requirements and compatible with most common used operational systems such as windows, IOS and Android. Specifically, this engine gives access to printed information, leading images of printed texts to audio voice. It allows the user to hear text typed on a sheet of paper or written/posted on a wall, in outdoors and billboards. The application has the following sequence: capture of images by the CCD camera of A4 printed text, preprocess the image, optical character recognition with an OCR - Optical Character Recognition software [8], extraction of the text [12], and synthesized voice generation with a TTS-text-to-speech application.

2 Used Technologies

2.1 Image Acquisition

The specifications of the digital camera are mandatory, which require mobile devices with 6–8 Megapixels resolution, main camera to capture images from an A4 size paper. The quality and resolution of the image are crucial for the preprocessing stage and character recognition. The OCR efficiency requires roughly up to 10\(\,\times \,\)10 pixels per character. The preprocessing stage improves the overall performance of the text recognition providing feedback to the OCR framework. Finally the TTS framework provides the audio stream to the internal Digital Signal Processing-DSP unit to drive the speakers. The text classification is a binary output in which an input text image is considered readable or non-readable without any character recognition. The OCR will be described in the next subsection. Figure 1 shows a flowchart of steps in the process, as taking a picture of the document, preprocessing as binarization, some small corrections in the skews, submitting to the OCR, calling the voice synthesizer, and the output to user.

2.2 Optical Character Recognition Tesseract

The Tesseract technology of Optical Character Recognition (OCR) does part of the task. It enables the extraction of texts from image data. Tesseract began as a PhD research project in HP Labs, Bristol. Today, Tesseract is an open-source OCR engine distributed by Google Inc. This technology has been used in photographed documents which one can play and listen to the content in audio formats, using TTS - resembling the spoken form of the same text, as read by a native. Well established in the field of computer science, early approaches to language research focused on automating the analysis of the linguistic structure of language. This research also relied on developing basic technologies such as fast digital processors, multi megapixel CCD cameras, OCR, machine translation, speech synthesis, among others. Nowadays, such tools are employed in real-world applications, creating spoken dialogue systems and speech-to-speech translation engines. Google Translate - GT is one such a practical application. It is a multilingual machine for text translation, speech, or real-time text images, from one language into another. It offers a web interface, and cloud interfaces for Android and IOS mobiles. It has one requirement: the photo-to-speech GT functionality is not autonomous, but depends on cloud processing. To improve system quality performance [9], a preprocessing stage of the image was carried out before submitting it to character recognition. The preprocessing stage improves the overall performance of the text recognition providing feedback to the OCR framework. The preprocessing include the following sequence of operations: color to gray scale transformation - 8 bits/pixel, image “binarization” (1 bit/pixel), image rotation and median filtering. Those image operations used OpenCV functions library.

OpenCV [10] written in optimized C/C++, with multi-core processing and real-time applications is very useful for our software. EmguCV [11] was used as a cross platform to .NET framework to call functions of OpenCV, to capture image of the printed text to be recognized. The Ziggi HD camera employed with 3264\(\,\times \,\)2448 pixels, has resolution enough to recognize the characters. The captured images needs to be rotated to be in portrait position. It was observed that the OCR needs a resolution of 300 dpi(dot per inch) approximately or a 10\(\,\times \,\)10 pixels per character to have good results in the recognition.

Some image operation are done inside Tesseract. The implementation and optimization of this autonomous photo-to-speech instrument aims, in a short period of time, the design of a fully accessible equipment. Text reading, available in digital format, for the visually impaired, requires the text conversion to the Braille reading system or, more recently, a digital speech synthesizer. Nowadays, most published printed works does not include audio versions nor Braille reading. Thus, the development of an autonomous and portable machine that captures images containing texts, converting them into speech [13] is greatly useful for visually impaired person [14].

Figure 2 shows relations, for an A4 paper, of resolutions, width, height and size of images. In our case it is necessary resolutions about 300 dpi, for better recognitions

2.3 Voice Synthesizer

The software application can use the embedded voice synthesizer [15] of the operational system or could use such free screen reader such as NonVisual Desktop Access-NVDA, eSpeak, or Alcance Voice Synthesizer, a Brazilian initiative project including voice synthesizer.

3 Operation and Features

Aiming to be very easy and friendly to use, the app needs few commands to be operated, as seen by the list of commands below. The accessibility of the desktop application version relies fully on the keyboard as follows:

(1) Return/Enter if the program is in idle mode, triggers a frame capture, and the whole cycle shown in Fig. 1.
(2) Esc if the program is in idle mode, closes the application - if a speech is playing, it is canceled.
(3) Left/Right keys decrease/increase speech rate.
(4) Space pause/resume a speech.
(5) Bar language choice.

4 Architecture

Figure 3 shows an operation with a stand to fix camera. So, better pictures of the texts can be obtained. In close up photos is necessary to avoid camera shakes. The Fig. 4 below shows the state machine model for a desktop software [17]. Using version for mobiles, to take close-up photos of text, is better to use a stand as shown in figure below. Figure 5 shows a special stand with four legs, designed for mobiles. It has the right distance for picture a A4 text, and the legs could guide the position of the paper.

For example, the accessibility of the mobile previous version operates under the native voice assistant Samsung. One touch on the screen triggers a frame capture, and the whole cycle. The “LETEX” app applies small corrections misalignment for angular position of text columns, up to \(30^\circ \). The text could be positioned upside-down.

5 Blind Users Review and Performance

5.1 Review

According a reviewer a digital reader like this one has good impact over the users’ lives, as in the case of independence and privacy when used with head phones, for example. Independence because the user can digitalize a text or a document alone without any help. The program is easy to open and run, and there are few buttons to push. Also the alignment of the paper is guided by a frame. The user can choose the speed and pitch of the voice synthesizer. Compared to equivalent commercial equipments this one is affordable, using a personal computer and/or mobile devices. In the case of mobile devices it is more useful in reading written characters in signs and outdoors. Although is not so easy to capture sharp photos through mobile caused by the vibrations. So, is better to use a stand as shown to have good text pictures.

5.2 Performance

Summarizing the pros and cons of performance pointed out by users and also observed in the preliminary tests are the following: Fig. 6 shows a two-columns text, that was inverted to negative image, and could be recognized by Tesseract OCR.

Figure 7 shows results of tests, with a very few errors, and a short elapsed time in the processing, indicating that the project can succeed, and the software is useful for blind persons to use. Following the premises of simplicity, few commands, affordability, few equipment requirements.

5.3 Pros

The software is a direct access to a document after the picture shot.
With a guide platform it is very easy to align a document to be photographed.
Its a very simple and intuitive to use and quite cheap to acquire.
Desktop version is very useful for documents reading and the mobiles version is better to read outdoors, indoor building directions and street signs.
User can choose the speed of speech and the voice pitch.

5.4 Cons

It will be practical if users could save the file (not implemented yet).
It will be useful if the program warns the users about impossibilities and causes of not reading certain documents or texts (not implemented yet).
Blind users have difficulties in framing an A4 printed paper with a smart phone/mobile.
Vibrations in the smartphone cause bad images and difficulties in the OCR.
The mobile version is very useful for outdoors sign reading as compared to document reading.

6 Conclusion

The proof of concept has shown feasibility of the project [18]. It has practical importance mainly for visually impaired persons to access printed matter in general. The independence of connection with internet implies no money waste with data plans for smartphones. The software and application are friendly and very easy to use, with few commands. The version for desktop is better for reading printed texts in A4 format and the version for mobiles is useful for street signs and outdoors.

7 Future Work

The results of the preliminary work with this project has shown the possibility of adding other functionalities like saving the read text in any desired format, even in audio formats. Without enhancing complexity, integrating the Brazilian voice synthesizer, Alcance-CPqD, as the default choice, and voice warns, like “there is no text!”, “please realign the paper!”.

8 Final Comments

By the end of 2011, the Brazilian federal government created a plan for people with disability. The plan entitled “Living without Limits” [16] has four main branch areas: access to education, health care, social inclusion and accessibility. It involves cooperation of 15 federal agencies, states and municipalities. This project has been supported by one of these Brazilian federal agencies - FINEP, under contract number 01.13.0038.00 coordinated by Funda de Apoio Capacitao a Tecnologia da Informao - Facti. Shown in caption of Fig. 5, the FourPod - stand for mobile capture has a patent pending [19].

References

ZoomText, ZoomReader. http://www.zoomtext.com/products/zoomreader/
iansyst Ltd.: CapturaTalk. http://www.capturatalk.com/
Google Inc.: Google Tradutor. https://play.google.com/store/apps/details?id=comgoogle.android.apps.translatehl=ptBR
LookTel: LookTel Money Reader. http://www.looktel.com/moneyreader
Image Searcher Inc.: TapTapSee. http://www.taptapseeapp.com/
https://www.cpqd.com.br/solucoes/cpqd-alcance/
Netoa, R., Fonseca, N.: Camera reading for visually impaired people. Procedia Technol. 16, 1200–1209 (2014). In: International Conference on Health and Social Care Information Systems and Technologies - HCIST 2014
Google Scholar
Smith, R.W.: The extraction and recognition of text from multimedia document images. Ph.D. Thesis, University of Bristol, Bristol (1987)
Google Scholar
Mithe, R., Indalkar, S., Divekar, N.: Optical character recognition. Int. J. Recent Technol. Eng. (IJRTE) 2(1), 72–75 (2013). ISSN: 2277-3878
Google Scholar
Itseez: OpenCV (Open Source Computer Vision). http://www.opencv.org
Emgu: Emgu CV. http://www.emgu.com/wiki/index.php/Main
GitHub: Tesseract Open source OCR Engine (main repository). https://github.com/tesseract-ocr/tesseract
Hirschberg, J., Christopher D.M.: Advances in natural language processing. Sci. Mag. 349(6245), 261–266 (2015)
Google Scholar
Rodrigues, A., Montague, K., Nicolau, H., Guerreiro, T.: Gettingsmartphones to talkback: understanding the smartphone adoption process of visually impaired users. In: Proceedings of the 17th International ACM SIGACCESS Conference, ASSETS 2015, Lisbon, pp. 23–32 (2015)
Google Scholar
Dutoit, T.A.: Short Introduction to Text-to-Speech. Kluwer Academic Publishers, Dordrecht, Boston, London (1997)
Google Scholar
http://zeroproject.org/policy/brazils-billion-dollar-national-plan-for-inclusive-education/
Singla, S.K., Yadav, R.K.: Optical character recognition based speech synthesis system using LabVIEW. J. Appl. Res. Technol. 12(5), 919–926 (2014)
Google Scholar
Leitor Digital. https://www.youtube.com/watch?v=SEM6slss2ls
Instituto Nacional da Propriedade Industrial, Depsito de pedido nacional de Patente, pedido BR 10 2013 0295515 A2. http://www.inpi.gov.br/

Download references

Acknowledgments

We have to thank FINEP for the financial support of the project and CTI Renato Archer for providing the supporting infrastructure.

Author information

Authors and Affiliations

Centro de Tecnologia da Informação Renato Archer, Campinas, Brazil
Sergio Celaschi & Sidney Pinto da Cunha
Fundação de Apoio à Capacitação em Tecnologia da Informação, Rod. D.Pedro I, km 143,6, Campinas, SP, 13069-901, Brazil
Mauricio Sol Castro

Authors

Sergio Celaschi
View author publications
You can also search for this author in PubMed Google Scholar
Mauricio Sol Castro
View author publications
You can also search for this author in PubMed Google Scholar
Sidney Pinto da Cunha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergio Celaschi .

Editor information

Editors and Affiliations

Foundation for Research and Technology – Hellas (FORTH), Heraklion, Crete, Greece
Margherita Antona
Foundation for Research and Technology – Hellas (FORTH), University of Crete and Foundation for Research & Technology – Hellas (FORTH), Heraklion, Crete, Greece
Constantine Stephanidis

A Appendix

It is possible to check on the internet the available commercial products. Check the addresses below:

Sara CE - http://www.freedomscientific.com/Products/LowVision/SARA
Eye-pal Solo - http://www.freedomscientific.com/Products/Blindness/EyePalSOLO
Poet Reading Machine - http://www.baum.de/cms/en/poetcompact2/
LavoiceSolo Reading - https://www.maxiaids.com/
Zoo-Ex - http://www.abisee.com/Zoom-EX.html
Magnilink Voice - http://lviamerica.com/products/readingmachine/magnilink-voice

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Celaschi, S., Castro, M.S., da Cunha, S.P. (2017). Read It Aloud to Me. In: Antona, M., Stephanidis, C. (eds) Universal Access in Human–Computer Interaction. Designing Novel Interactions. UAHCI 2017. Lecture Notes in Computer Science(), vol 10278. Springer, Cham. https://doi.org/10.1007/978-3-319-58703-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-58703-5_19
Published: 16 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58702-8
Online ISBN: 978-3-319-58703-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics