Abstract
Live communication over long distances is indispensable for seniors, and tools have evolved to improve the sense of presence: sight is added by video and robots represent bodies at remote locations. Emerging technologies that assist cognitive abilities may improve the communication quality beyond reality. However, they have been independently developed and cannot be integrated easily. This results in not only raising the development cost but also hampering new technology being installed in this area. Therefore, we propose a platform for remote and live communication that supports portable and fast data transfer connections as fundamental functions and possesses a plug-in framework that enables features to be extended dynamically on the basis of a common interface. In this paper, we explain the design of this platform and describe some plug-in based applications and scenarios built on it as examples validating the concept.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Senior workforce
- Remote communication
- Tele-presence robot
- Agent conversations
- Physical devices
- Augmented reality
- Cognitive assistance
- Accessibility
- Plug-ins
- WebRTC
1 Introduction
Quality of communication is important to put collaborative work into practice. Face-to-face conversation enables much faster comprehension than delayed communication with static documents, hence we consider meeting in person when we are frustrated with written communication. The richness of information in live communication enables us to more efficiency read emotions and situations, which results in higher quality of understanding. However, these requirements of communicating in person kept away people who had barriers to travelling, including the elderly, from working in important situations. Hence they desired tools for live communication overcoming distance to participate in collaborations in a practical way.
Technologies or products fulfilling this demand has been coming closer to imitating face-to-face conversations and have evolved a richness of information to expand the degree of thorough and clear understanding and sense of presence. A video conferencing system allows us to view remote sites, and tele-presence robots represent bodies at these remote sites. Recently, vision technologies and sensory devices are also evolving and assisting the comprehension of situations.
However, those technologies have been independently developed without common protocols or re-usable interfaces. This situation has inhibited developers from integrating one product with another or distributing effectual implementation to the applications that contextually have features that would benefit from it. For example, when we want to apply a voice command operation set of one application to another, we need to look for the API providing the interface accessing the voice stream of the participants and then see if we can port the implementation to it. Furthermore, this scenario is premised on the application allowing access to those resources, but some do not have any programmable interface [2].
Therefore, we propose a platform for remote and live communication in which features can be extended flexibly. In this platform, features are developed as plug-in components so that they can be assembled and re-used with each other, guaranteeing portable and fast data transfer connection among remote sites.
In this paper, after reviewing the related works, we describe the design of the platform and applications using it as proof of concept. We also discuss the possible integrations and then present our conclusion.
2 Related Works
2.1 Technologies that Help Live Communication
In addition to the auditory conversation with phones, video conferencing has been prevalent since Skype [1] was released. It also extended a platform of applications using the Internet and liberated people from using physical phones. Such live communication using audio and video has become ubiquitous, and now the expansion in the medium source and expression is beginning. Nowadays, various types of tele-presence robots provide different experience on top of video conferences. Double [3] and Beam [4] can be driven remotely, while Kubi [5] pans and tilts its face on a table. The anonymized face of OriHime [6] encourages patients who cannot leave the hospital or home to communicate with others outside through it [7]. Technologies that are categorized as “tele-existence” cover the expressions of the five senses. TELESAR V [12] shares tactile information of an avatar. Kissenger [13] is a pair of mobile phone devices that enables the users to kiss each other remotely and helps them to maintain intimate relationships.
Aside from remote communication, technologies that analyze or sense real-time data are evolving. Speech recognition is used to set a caption on the video for better comprehension for people who have difficulty hearing [14]. Vision technology that recognizes features of a face [11] is utilized to make visually impaired people aware of people approaching them. There are even some prototypes extracting emotion from facial expressions [10], or replacing/deforming the user’s image to a favored one [15, 16] in real-time. Digital Audio Controller enables turning on/off the volume from multiple audio sources to pick up and focus on what the listener should/wants to hear [17].
2.2 Platforms that Help Making Application for Communication
OpenTok [18] and SkyWay [19] are used to build live communication services such as a video chat application and remote console of tele-presence robots [3, 5]. OpenTok initially used their proprietary protocol on Adobe Flash, but both frameworks now use WebRTC [23] for transmission protocol layer. These platforms are used for the applications beyond video conferencing but still do not have the framework in which features can be extended or re-used dynamically.
Each robot requires specific software to design its behavior. The Robot Engine [20] is a platform to design behavior of any type of humanoid robots. f3.js [21] is an attempt to capsulate the embedded devices (microcomputer, sensor, and actuator) as programmable components so that the developer integrates them using a single programming language (JavaScript). These platforms are made for the local devices, but will also be useful to extend the capability of the people working from remote places.
3 Platform
Here we propose a platform for live communication in which we can develop and add features extensibility. We prototyped this platform and developed components for the live communication. Applications consisting of those components on this platform were used during a remote ICT lecture course among senior citizens [22]. In this section, we describe key components and the design of this platform.
3.1 WebRTC
WebRTC [23] is a W3C Editor’s Draft (as of January 2016) standard for peer-to-peer (P2P) connection using browsers. In addition to video and audio channels, it has a data channel with which the application transfers general data.
To achieve a sense of reality during communication with remote sites, the speed of data transmission is critical. To achieve faster network connections, the route to the destination should preferably have as few nodes as possible. This was one reason we chose the P2P transmission standard. Because the ICT lecture course was held at an all-purpose meeting room in a public community center, we had to set up the equipment in a limited amount of time for the class every single time. Hence we wanted to reduce the steps and machines for the preparation in order to start the class quickly, and the less dependency on the proprietary tool of WebRTC allowed us to reduce the number of prerequisites or machines to be installed or launched.
3.2 Implementing Features as Pluggable Extensions
Figure 1 shows the diagram of how the components that represents individual feature are arranged in the platform. The platform on the basis provides a component that manages connections using WebRTC. The connections initiated by this manager have interfaces so that the other components can write and read the data. The connection is established in peer to peer, and the data is transmitted with less latency. This benefits the components that are handling real-time data in the application. These components are loosely coupled by an event publish-subscribe pattern, so that the features for individual application can be extended/woven dynamically.
Components supposed to be arranged within this platform vary: not only the bridge of a camera device as an audio and video source but also the bridge for physical devices to realize the body remotely; an interaction assistant that helps manipulation of those devices; and an informational assistant such as speech recognition to give captions or minutes of what is said in the meeting.
3.3 Bridge for the Physical Devices
Web browser, one of the prerequisites of this platform, is almost ubiquitous, but we sometimes need to overcome its limitations, such as cross domain restrictions or limited access to outside resources. The serial communication is one of them, but it is the most popular way to communicate with the physical devices such as actuators or sensors. To drive robots from the components in the platform, we implemented one server running on the same computer of the browser using Node.js [24] as a communication proxy with physical devices. This server communicates with the components in the platform using a web socket and transfers commands/response data to/from the serial port. We also made a basic component capsuling this series of exchanges with this server, so that it can be re-used for the other components that need to take control of physical devices.
4 Application
Here we describe applications as example implementations using this platform and some possible combination scenarios.
4.1 Cogi: A Tele-presence Robot with “Observation Window”
Cogi is a tele-presence robot that is controlled by the movement of the operator’s head on the basis of the concept of “observation window” motif [25]. Cogi was built on this platform and used by the main lecturers during the experiment of remote ICT lecture courses among senior citizens [22]. As basic functions, this platform has two components: “vc.pero” is the manager of connections, and “vc.video” makes video and audio stream from a camera device. We made an embeddable video conference component from them and used it to make an application for lecturers with integrated view from multiple camera sources at the remote site. In addition, we added components to build the console application for Cogi: “vc.htrack” represents “observation window” behavior, and “vc.cogi” handles serial communication for the robot. “vc.htrack” subscribes the image from “vc.video” and locates the position of the controller’s head. Then it calculates the rotation angle of the controller’s head and transmits it to the robot through “vc.cogi” to make it follow the operator’s head. Meanwhile, it transmits that angle to the remote site through “vc.pero”, and then the “vc.htrack” at the remote site forwards it to “vc.cogi” so that the robot at the remote site faces the place the operator intended (Fig. 2).
This “observation window” mechanism helped senior lecturers to control the robot at the remote site without heavy cognitive load, even while they were conducting the class. Additionally, “vc.htrack”, which is the implementation of this mechanism, can be applied to other types of robots of similar capability such as Kubi [5] or OriHime [6] just by adding bridge implementation for individual devices as replacement of “vc.cogi”.
Cogi also has features of adding captions from the conversation and taking minutes. For these features, we prepared three additional components: “vc.sptotxt”, which extracts the text from the spoken voice using Speech to Text (STT) service; “vc.caption”, which shows the text recognized by “vc.sptotxt” on the screen; and “vc.minutes”, which saves the history of the recognized text and has UI to show and edit it. Both “vc.caption” and “vc.minutes” subscribe to the single instance of “vc.sptotxt” to reduce the footprint. Some browsers do not allow voice data to be extracted from a remote audio stream, so we added an additional fix on “vc.sptotxt” to send the command to call the STT service at the remote site and pass it back just by re-using “vc.pero”. This demonstrates the advantage of plug-in based development in which features are componentized with portability.
4.2 Virtual Agent Mocoro Follows and Facilitates Communication
Mocoro (Fig. 3) is a virtual avatar that listens to people and conducts the communication with its affordance. It acts in motion with its Kawaii [26] appearance in accordance with the context of the conversation. Mocoro has two panels at the bottom of the screen to show assistive information such as the aim of the meeting, pictures, and so on. Mocoro is built on this platform and re-uses the same STT extension of Cogi (“vci.sptotxt”) to understand what the people are discussing. In addition, it has a morphological analyzer using Kuromoji [27] to extract keywords and Text-to-Speech (TTS) extension to reply to people vocally. With those components, it can have a simple conversation: it answers vocally when it is asked its name, shows a pre-registered photo on the information panel when it is requested by voice, and displays the keywords recognized in the talk in a tag-cloud form on the information panel.
Elderly people often have high level knowledge or skills in a particular area, but they might not have experience of teaching. When the teachers are not used to conducting classes, they probably do not support the students who are catching up with the contents. This possibly results in worse comprehension of the class, and Mocoro is designed to aid the speakers in this area.
Nonverbal information is important to understand people [28]. However, it is not conveyed well enough through the display. People who are not used to talking in front of audiences are likely to speak too fast to be understood because they feel pressured and nervous. Therefore, the speed of speech must be appropriately controlled. In addition, the teacher should not stutter much because students will be irritated and less engaged in the class. However, it is difficult for novice teachers to realize this by themselves. Mocoro is placed beside the main screen that the teacher is facing during the class, and, in such cases, it reflects the confusion or irritation of the students on the other side. Mocoro gradually changes the color of its face from pink to blue and hangs its head little by little if the teacher speaks too fast or stutters a lot. Mocoro does not speak to the teacher aloud to avoid interrupting the class. Instead, it shows the estimated feelings of students quietly and moderately. When the teacher notices Mocoro is looking down with a very pale face, he/she asks it “Are you all right?” Then, it looks up and replies “Yes, I am okay but you speak too fast for me.” In this way, Mocoro reminds the teachers that they should speak at a more moderate pace for the students, and consequently, we expect they will come to correct themselves naturally.
Mocoro can also be used in multipurpose meetings. When a meeting proceeds fast, some attendees may be unable to follow the contents. However, they sometimes hesitate showing that they missed something and pretend they understand even, or especially, when that meeting is important and all of them need to have common understanding. Mocoro works for such cases: it listens to the participants and estimates the speed of the transition in the topics by counting the number of the keywords extracted using the STT and morphological analyzer component. If too many topics are discussed in a short time, it explicitly warns that the above problem might occur by moving its head and changing the color of its face.
These embodiments of affordance by action of Mocoro are leveraged by the components of this platform and will portably be applied to other scenarios regardless of whether the participants are joining remotely or not.
4.3 Magnify the Sound of the Remote Site Using “Observation Window”
Acoustic zooming [17] was named “Zoomap” and optimized for the experiment of remote ICT lecture courses among senior citizens [22]. It assists lecturers’ comprehension about what the students at the remote site are saying by magnifying the volume of some audio sources and diminishing the others. Each student’s voice is captured by a microphone on the headset. Lecturers use tablets on which the level of the volume of each source is visualized with a location and select the area for which the audio source should be magnified (Fig. 4).
When Zoomap was introduced, one lecturer told the students that he was now able to “see” the sound in the remote class room and could catch what the students were saying even when they were speaking in subdued voices. As can be seen from here, Zoomap was positively accepted by the lecturers and effectively helped them to understand what was going on at the remote site during the class. However, we also observed that some lecturers were not able to use it, especially when they were busy conducting the class. We concluded that this was because operating it using the tablet required increased cognitive load.
We hereby suggest introducing an observation window motif of Cogi to select the audio source to be magnified. For example, when the lecturer looks into it from the right-hand side, the voice of the student on the left side of the remote site is magnified, and vice versa (Fig. 5). In this way, lecturers do not need to use their hand to listen to the place at which they are looking. Zoomap uses direct connection to transfer the volume level of the audio sources and the command to magnify them, which can be replaced with the P2P data channel of this platform using “vc.pero”.
4.4 Morphing the Representation or Expression of the Communication
Vision technologies are evolving, and some studies are showing the possibility of extracting the features of the face [10] and morphing it in real-time [16]. When we integrate those vision technologies into the live and remote communication, we might even be able to augment the communication more than we do when face to face. In fact, one of the reasons OriHime [7] is popular with the patients who cannot move from their hospital beds is because its anonymized face covers their real appearance in bed. Also, when we suggested a simplified configuration of a remote lecture system [22] for senior lecturers so that they can conduct classes from home, some female lecturers were concerned that they would need to make themselves up carefully even at home. When this kind of morphing can be woven onto the video stream, it may remove the hesitation in communication in such cases.
5 Possible Future Considerations
For various devices being mixed in one place, we need to consider the conflict between the fundamental nature of the device and the application scenario. For example, there will be some conflict in taking turns to use robots when multiple participants take part in a meeting session, because most tele-presence robots are designed to be used by a single user. We might have a conventional resolution using UIs such as buttons on the screen to “raise a hand”, but some interaction in which the robot automatically chooses who takes control would be desired in practical use.
Also, some common framework will be needed on morphing the feature in accordance with the setup on the remote peer. Cogi basically uses robots in both peers (one tracks the face of the controller, and the other at the remote peer moves in accordance with the motion of the controller) but has the option of a simplified setting mode when there is no robot at the controller site. We transmitted the information about how the remote peer is set up in the initial handshake phase, but some standard or common implementation that describes the attributes of a remote environment like the profiles of Bluetooth would be preferable.
6 Conclusion
We proposed a portable and extensible platform for live remote communication and described our prototype as an example of how it can be utilized.
We expect this platform to improve actualization of the ICT helping people to communicate and work together over distances by not only saving on the development cost but also drawing the channel to introduce new technology into this area. Consequently, we will be able to communicate with others regardless of the distance with a certain sense of reality or beyond by augmenting cognitive ability, which will help the people who face barriers to travelling to work practically. Optimization of working resources in the community is important in rapidly aging societies, and the concept of this platform will help dig up hidden talent and resources.
References
Skype. http://www.skype.com/
FaceTime. http://www.apple.com/ios/facetime/
Double. http://www.doublerobotics.com/
OriHime. http://orihime.orylab.com/
How the Japanese robot avatar OriHime fights loneliness. https://www.techinasia.com/orihime-robot-fights-loneliness
Kinect. http://www.xbox.com/en-US/xbox-one/accessories/kinect-for-xbox-one
Jeni, L.A., Girard, J.M., Cohn, J.F., Kanade T.: Real-time dense 3D face alignment from 2D video with automatic facial action unit coding. In: Automatic Face and Gesture Recognition (FG) 11. IEEE (2015)
Fernando, C.L., Furukawa, M., Kurogi, T., Hirota, K., Kamuro, S., Sato, K., Minamizawa, K., Tachi, S.: TELESAR V: TELExistence surrogate anthropomorphic robot. In: SIGGRAPH 2012, Article 23. ACM (2012)
Samani, H.A., Parsani, R., Rodriguez, L.T., Saadatian, E., Dissanayake, K.H., Cheok, A.D.: Kissenger: design of a kiss transmission device. In: Proceedings of DIS 2012, pp. 48–57. ACM (2012)
Automatic captions in Youtube. https://googleblog.blogspot.jp/2009/11/automatic-captions-in-youtube.html
Dale, K., Sunkavalli, K., Johnson, K.M., Vlasic, D., Matusik, W., Pfister, H.: Video face replacement. In: Proceedings of SIGGRAPH Asia 2011, Article 130, 10 pp. ACM (2011)
Realtime face deformation. https://auduno.github.io/clmtrackr/examples/facedeform.html
Izumi, M., Kikuno, T., Tokuda, Y., Hiyama, A., Miura, T., Hirose, M.: Practical use of a remote movable avatar robot with an immersive interface for seniors. In: Stephanidis, C., Antona, M. (eds.) UAHCI 2014, Part III. LNCS, vol. 8515, pp. 648–659. Springer, Heidelberg (2014)
OpenTok. https://tokbox.com/
Bartneck, C., Soucy, M., Fleuret, K., Sandoval, E.B.: The robot engine - making the unity 3D game engine work for HRI. In: Proceedings of RO-MAN 2015, pp. 431–437. IEEE (2015)
Kato, J., Goto, M.: Form follows function(): an IDE to create laser-cut interfaces and microcontroller programs from single code base. In: Proceedings of UIST 2015 Adjunct., pp. 43–44. ACM (2015)
Takagi, H., Kosugi, A., Ishihara, T., Fukuda, K.: Remote IT education for senior citizens. In: Proceedings of W4A 2014, Article 41. ACM (2014)
WebRTC. https://www.w3.org/TR/webrtc/
Node.js. https://nodejs.org/en/
Kosugi, A., Kobayashi, M., Fukuda, K.: Hands-Free collaboration using telepresence robots for all ages. In: Proceedings of CSCW 2016. ACM (2016)
Cheok, A.D.: Kawaii/cute interactive media. In: Cheok, A.D. (ed.) Art and Technology of Entertainment Computing and Communication, pp. 223–254. Springer, New York (2010)
Kuromoji. http://www.atilika.org/
Burgoon, J.K., Guerrero, L.K., Floyd, K.: Nonverbal Communication. Allyn & Bacon, Boston (2010)
Acknowledgements
This research was partially supported by the Japan Science and Technology Agency (JST) under the Strategic Promotion of Innovative Research and Development Program. We thank all the participants of the experiments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Kosugi, A., Nishiguchi, S., Izumi, M., Kobayashi, M., Hiyama, A., Hirose, M. (2016). Augmented Live Communication Workspace Platform to Assist and Utilize Cognitive Abilities of Senior Workers. In: Antona, M., Stephanidis, C. (eds) Universal Access in Human-Computer Interaction. Users and Context Diversity. UAHCI 2016. Lecture Notes in Computer Science(), vol 9739. Springer, Cham. https://doi.org/10.1007/978-3-319-40238-3_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-40238-3_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40237-6
Online ISBN: 978-3-319-40238-3
eBook Packages: Computer ScienceComputer Science (R0)