Keywords

1 Introduction

Studies on education aided by humanoid robots with cognition function [1] have attracted interest for years. A humanoid robot has worked as a teaching assistant controlled by an instructional design tool in primary education [2]. An important characteristic of a humanoid robot for education is social interactivity with learners. A socially interactive humanoid robot in an educational environment can increase the efficiency of learning [3]. A fuzzy control system for robot communication has been proven effective in promoting self-efficacy in language learning [4]. A combination of educational robot and multimedia learning materials has been proven beneficial for increasing student motivation [5]. Furthermore, interactions with humanoid robots increase human creativity [6].

This study introduces the NAO humanoid robot (Aldebaran Robotics, SoftBank Group) for a principal’s speech and presentation at an elementary school. NAO has programmable gesture and dialog capabilities, which enable cognitive interaction with humans based on recognition functionalities for speech, faces, and objects [7]. Educational researchers have utilized NAO for the instruction and care of children with autism [810]. The expressive and affective behaviors of the robot improve communication and reinforce learning [11]. Furthermore, future smart environments and ambient intelligence are expected to produce witty humor [12].

Educational and therapeutic use of humanoid robots typically involves interactive relationships between humans and humanoid robots. In contrast, stage speeches or presentations are essentially one-way forms of communication, i.e., the audience is passive. A multi-robot system has performed Manzai, which is a Japanese-style comedy talk show, usually performed by two people, as a passive-social medium [13]. Hayashi et al. used a network to facilitate communication between robots performing Manzai rather than direct speech recognition between them because the sensing and recognition system was inadequate to create Japanese-style performance comedy. The humanoid NAO’s recognition capability is satisfactory because the words it must recognize are pre-registered or pre-downloaded before the corresponding dialog occurs.

In this study, we examine a pilot system with the humanoid NAO that downloads keywords and related dialogs from an external e-learning server in which data are interconnected semantically and structured on the basis of topic map technology [14, 15]. To engage dialog between a humanoid robot and a person, it is necessary to know what topic words the robot has downloaded. For this purpose, a downloaded list of topic words is shown on a web page generated by the Internet server of NAO’s operating system. A human presenter checks the list of candidate words using a see-through wearable display connected to an Android device while engaging the dialog. A stereoscopic 3D display was used to allow simultaneous observation of the humanoid robot and the retrieved information. This paper describes a simple stereoscopic 3D vision method to display the topic texts at an appropriate depth.

2 Method

Topic Map-based e-Learning Server.

The author has created the “Everyday Physics on Web” (EPW) e-learning portal. The EPW system is based on topic map technology [16, 17]. Topic maps (ISO/IEC JTC1/SC34) represent information using a “topic,” “association,” and “occurrence.” Topics represent subjects that are interconnected by various types of associations. An occurrence is a specific type of association that connects a topic and the actual information resources, such as text and web pages. The networked structures of topics are referred to as topic map ontology. Topic maps enable rich and flexible indexing based on semantics and increase the findability of information. Since one can edit the knowledge structure in the topic map tier, a topic maps-based web system is efficient relative to extensibility and manageability.

The EPW topic map was built using the “Ontopia” [17] topic maps server. Ontopia has its own topic map query language, “tolog,” and the navigator framework, which consists of a set of tolog tag libraries, can generate JavaServer Pages. In addition, Ontopia has a web service interface, i.e., the Topic Maps Remote Access Protocol (TMRAP), which enables retrieval of topic map fragments from a remote topic maps server. By enabling TMRAP with EPW, one can utilize the topic map of the EPW server on the client. One can perform tolog queries from the client to retrieve any element of the EPW server.

See-through Wearable Display.

A see-through type binocular wearable display, i.e., the EPSON MOVERIO BT-200 (Fig. 1a), was used to monitor the topics of the humanoid robot’s talk. The BT-200 has Wi-Fi and Bluetooth connectivity. In this study, the mirroring capability over Wi-Fi was used to show the PC display on the wearable display. In addition, the binocular dual displays work as a stereoscopic side-by-side 3D viewer. It has a 960 × 540 (pixel) area as the 2D display, and side-by-side 480 × 540 (pixel) areas are transformed into the whole area and shown to both the left and right eyes for 3D display. The distance between the left and right displays is 65 mm. In the 3D display mode, the image appears to be located approximately 4.5 m in front of the user.

Fig. 1.
figure 1

a: BT-200 see-through display with supplemental lenses; b: side-by-side image of the S3D test application. Each of the side-by-side images is transformed from 8:9 to 9:16.

Test Application.

A simple Adobe Flash application was created to test the stereoscopic 3D (S3D) of text representation in the environment using the Papervision3D library. This application shows the same “Y-letter” as three lines connected at their edges in a plane parallel to the image plane (or screen), as shown in Fig. 1b. The same “Y-letter” lines were drawn in the centers of the side-by-side windows on a black background. These windows are movable in the horizontal directions by pushing the buttons. When this side-by-side display of the “Y-letter” is represented in the 3D mode, the left and right images exhibit a parallax, and the image appears at a distance of approximately 4.5 m from the user by default. The parallax of the image is changed by moving the right or left window. This application works on the Flash player on a PC. Then, the application display is mirrored to and controlled by the BT-200.

Test Process.

We must be able to see the retrieved text while looking at and talking to the humanoid robot. Thus, the accommodation point of the user’s eyes is primarily at the position of the robot. To observe the robot and the projected text simultaneously, the S3D position of the text must be moved to the appropriate position (i.e., depth) of the robot. For this purpose, the distance between the side-by-side windows of the application was reduced. When the accommodation and convergence agree at a position, we have clear focus. Some applications have been published on the MOVERIO application site [18, 19].

To test the effect of convergence on the cognition of the depth position and a possible illusion occurring during the display of the “Y-letter,” two tests were conducted. The participants were asked to hold and focus on a cube as a target object, as shown in Fig. 2 (Tables 1 and 2).

Fig. 2.
figure 2

Holding the cube

Table 1. Test 1: Check stereo image and 3D illusion
Table 2. Test 2: The movement of the side-by-side images

Test 1 checks if the depth of the line image is changeable by moving only the right side image. In addition, to consider how the projected image merges with the real scene, the illusion of line caused by the pictorial cues of cube edges.

In test 2, the participants were asked for their impressions about the effect of the different ways to converge the side-by-side images. We compared the impressions about the motion of the line images with and without coexistence of the cube near the line images.

NAO Humanoid Robot.

An application of the NAO humanoid robot was created using its development environment, “Choregraphe.” Note that the “NAOqi” programming framework is capable of using external service APIs. In this study, URL TMRAP requests were sent to the EPW server to obtain a candidate list of topic names. When the person speaks one of the list items and when it is recognized by the speech recognition system, the dialog occurrence and topic list associated with the previous topic are again requested from the EPW. Then, using speech recognition, NAO speaks the dialog occurrence and waits for the person to speak. Thus, the person needs to know the candidate list of words retrieved by NAO. Then, when NAO obtains the topic name list, it generates a web page for the topic name list on its internal website. The person requests this page from NAO’s URL, as if looking into the robot’s mind. In this manner, the dialog between the humanoid and human can be developed dynamically using the knowledge structure of the EPW topic map.

Manzai and Comic Frame.

Japanese comic performance has a common framework, i.e., Furi, Boke, Tsukkomi, and Ochi. Furi is a proposal or a suggestion for an interest, topic, and atmosphere of talk. Boke is speaking funny lines, and Tsukkomi is responding to the funny lines to make it impressive. The flow of Furi, Boke, Tsukkomi, and Ochi is a traditional framework of Japanese comic performance, such as Manzai. The speaking system with the humanoid robot in this work has the same structure. First, the human provides Furi from the topic list retrieved by the robot. Then, the robot presents the dialog occurrence, which is the main content of the talk. Finally, the human responds to the content shown by the robot. Again, the human can restart the frame choosing one of the related topics. If the talk is humorous, it is considered an effective representation of Manzai.

3 Results and Discussion

3.1 Topic Maps-Based Dialog

We conducted speech demonstrations, as shown in Fig. 3, based on the combination of topic map retrieval and the original dialog box of Choregraphe. Notably, the latter provides an exchange of rather humorous words. The former topic map talk was conducted to obtain knowledge about particular words. The human intermediates these knowledgeable parts in an entertaining discussion.

Fig. 3.
figure 3

Principal talking with NAO in an elementary school. The principal wears the BT-200 to talk with NAO.

NAO and the principal’s talk began in autumn 2015. Many elementary school students appear to look forward to NAO’s talk with the principal. From autumn 2015, a second-grade class began to play postal system as an activity in school. From then until January 2016, the principal received 46 letters from students; 70 % of the letters referred to NAO. In particular, the knowledgeable phrases that NAO spoke often encouraged the students to giggle or laugh. Students might feel a sense of incongruity when they see the humorous and expressive NAO speak intellectual words, and the principal admires this.

Verbal communication is more preferable for human–robot interaction than communication mediated by a computer. However, speech recognition as an interface is far from ideal. Thus, at least for now, humans require a visualization of the robot’s “knowledge” or “brain.” In addition, such a relationship is consistent with the human’s role with Tsukkomi. NAO is not remote controlled, such as by a PC. In this sense, a see-through wearable type display with information retrieval from the robot’s brain is preferable to remote control by PC.

3.2 See-Through Stereoscopic 3D Rendering in Space

Twenty-three subjects aged 19 to 21 years participated in the stereoscopic 3D line rendering experiments. Only two were female. If the participant wore glasses, they were asked to wear the see-through display over their glasses.

Figure 4 shows the percentage of participants who recognized a change in the position of the lines. Most participants felt that the lines were placed in the real space, and their positions shifted back and forth around the real cube held in front of their faces. However, a few participants saw the lines as split or shifted in the opposite direction. This was observed when the lines were shifted around the cube at a fixed position in space. Note that the shift motion was controlled by the experimenter rather than the participant. A more detailed investigation of the combination of accommodation and convergence for motion is required.

Fig. 4.
figure 4

Stereoscopic cognition of the position shift in test 1.

Figures 5a and b show the percentage of participants who observed the 3D illusion on the projected lines. The “Y-letter” shaped lines were drawn on a vertical plane parallel to the eyes so that there is no parallax in the shape. In addition, since the shape is symmetric, it has particularly few cues for the emergence of the illusion. However, a possible visual illusion might be in the way that the junction of three lines looks raising toward the observer, or oppositely caving.

Fig. 5.
figure 5

Occurrence of 3D illusion on the lines rendered in the vertical plane. Participants were asked if the “Y-letter” shaped lines appeared “plane,” “rising at the junction,” or “caving at the junction.” a: Lines were located around the cube arbitrarily directed in space; b: three lines were located to overlay the three edges of the cube.

Figure 5a shows the results when the lines were observed around the real cube, which normally appears in 3D. Most participants felt that the lines were on a plane. However, as the lines moved around the cube, a few participants observed an illusion.

Figure 5b shows the results when the lines were overlaid on three edges of the real cube. The percentage of participants who observed the raising illusion increased significantly. Furthermore, as the lines shifted back and forth, the illusions decreased. In addition, it is intriguing that the rate of the cave-shaped illusion increased when the lines were observed on the real cube.

In test 1, the back and forth motion of the lines was generated by moving only the window of the right image of the lines to the left, while the left window was fixed. Even such asymmetric manipulation allows perception of motion in the depth direction. Only a few participants commented about the sense of asymmetric shift. Then, in test 2, the participants were asked if they could observe the right (for left window shift) or left (right shift) component of each shift with and without the cube.

Figure 6 shows the percentage of participants who observed right or left displacement of the lines. Even when both sides of the windows were moved simultaneously, 20-40 % of the participants indicated that they could observe right and left displacement separately (e.g., a zigzag-like motion) or the lines appeared split. Note that a few participants claimed visual fatigue after this test. Thus, frequent and quick manipulation of the convergence might be visually uncomfortable.

Fig. 6.
figure 6

Percentage of participants who observed horizontal shift motion while shifting the side-by-side images.

4 Conclusion

Human–robot paired dialog was performed using both a dialog list and dynamic retrieval of information from a topic map server. A binocular-type see-through wearable display, i.e., the EPSON MOVERIO BT-200, was used to monitor the information retrieval.

The stereoscopic 3D display was used to allow simultaneous observation of the robot and the retrieved information. The distance between the side-by-side images was changed to control the convergence and allow the image to be observed at an arbitrary depth in the view of field.