Keywords

1 Introduction

Virtual Reality (VR) is an established technology, particularly in gaming applications. Now that the corresponding content and technologies are becoming accessible for a vast number of users, people start to consider VR as a new media form with a plethora of new possibilities for entertainment, industry, art, communication, tourism and much more. While VR primarily seems to be the perfect medium for games, filmmakers begin to explore it as a new medium to tell stories. This includes both, computer-generated VR experiences and filmed 360-degree films. However, it turned out that planning and shooting a film in 360 degrees does not work the way it does in normal movies. The viewer takes the place of the camera and can choose freely where to look. This brings a high risk that he might miss parts of the action because he focuses on something not relevant to the plot. Likewise, if the pacing is wrong, he can become either bored or completely overwhelmed by a scene. Many of the established techniques of filmmaking like editing or dolly-shots can even be distracting when applied to VR film. The established filmic language does not seem to work anymore when applied to VR and 360-degree films.

VR filmmakers are already eagerly experimenting with solutions involving rather artistic approaches, but there is little scientific research yet on those specific issues.

In the scope of this study we subsume filmed 360-degree films under the term VR film, although there are technical differences. We consider cinematic narration in VR as the toolset of methods and techniques, which enable the filmmaker to guide the viewer through the scene and the narrative. Hence, this study takes a comprehensive approach to the topic, putting the various aspects into context with each other. Initially, we determined six major challenges for cinematic narration in VR:

  1. 1.

    Guiding the viewers’ attention to the relevant story elements

  2. 2.

    Choosing the role of the viewer between an active participant and a passive observer

  3. 3.

    Choosing the right place for the camera, the action and story elements and what consequences this has for seated viewers

  4. 4.

    Balancing spatial and temporal story density

  5. 5.

    Rethinking Framing

  6. 6.

    Rethinking Editing

We approached these challenges by examining already applied methods, comparing approaches in other media forms, conducting an empirical test with 50 participants, an online survey and by making connections to other fields of research. Section 2 briefly summarizes the related work on VR film and current challenges for VR filmmakers. In Sect. 3 we introduce guidelines for cinematic narration in VR that make it easier to deal with the challenges mentioned above. In Sect. 4 we present our user study and its results, which was conducted to verify the assertions of the guidelines.

2 Theoretical Background and Related Work

Research on the specific topic of cinematic narration in VR was rare until 2017. Since then, several studies have been published. The issue of guiding the viewers’ attention has been examined in several studies. The effects and effectiveness of diegetic and non-diegetic attentional cues in VR have been compared and evaluated [2, 3]. Brillhart introduced the Probabilistic Experiential Editing (PPE) as an editing concept for VR film [4]. Reyes [5] discussed a “Screenwriting Framework for an Interactive VR Film”, which provides the dramatic structure of the hero’s journey in interactive VR and 360-degree films.

To get a better understanding of VR as a medium and how film can be implemented in it, it is useful to look at research in many other areas such as film, games, theatre, narratology and human perception. Therefore, the preceding study [1] made a broad examination of the related topics: The VRs high chance of creating a feeling of presence has been explored, based on the Two-Level Model of Spatial Presence, introduced by Wirth et al. [6]. This presence, in which the viewer locates himself in the scene, is one of VRs greatest strengths, but also one of the biggest challenges for storytelling in VR. If the viewer feels part of the scene, his role also needs to be considered in the story [1].

As he might feel and react like in a real-world situation, it can have a strong impact on his perception and interpretation of the VR scene. This is for example reflected by the effect of proxemics in VR. Depending on the zones of interpersonal distance in which a character is placed around the camera, it has direct emotional effects on the viewer [7]. Regarding proxemics, Pope recommended the use of theatre techniques for staging in VR [7].

Furthermore, when feeling presence, a viewer may expect more agency in the virtual environment. If this agency is restricted, like in a non-interactive 360-degree film, it can have a negative effect on the feeling of presence [8]. On the other hand, if agency is given i.e. in an interactive VR experience, it can become difficult for the storyteller to maintain the narrative structure. In game studies this is a much-discussed issue that high viewer/user agency and a pre-scripted narrative can contradict each other. Ruth Aylett coined this issue the “narrative paradox” [9].

3 Guidelines for Cinematic Narration in VR

Based on the research of related work and preceding studies, we approached solutions for our six initial challenges:

3.1 Guiding the Viewers’ Attention to the Relevant Story Elements

A filmmaker cannot know for sure where a viewer is going to look. Generally, the VR experience should give the viewer the freedom to choose his viewing direction. Forcing the viewers’ attention to a specific element destroys immersion, thus contradicting the great potential of VR. Despite this, a filmmaker needs to make sure that the story is not missed by the viewer. Fortunately, he can influence this by implementing attentional cues that guide the viewers gaze to the relevant parts of the scene. There are several methods to guide the viewers’ attention. However, if an attentional cue is to be seen, it must be made sure that it is within the viewer’s field of view (FOV). Combining different attentional cues and even distributing a surplus of cues in the scene can increase the chances of guiding the viewers’ attention. In the user study, we focused on the following diegetic attentional cues:

  • Gazes. Faces attract our attention. Therefore, glances of the characters can work as attentional cues. Gazes of one or more characters in a specific direction are most likely to be followed. This way they are useful to guide the viewer’s attention to a place beyond his current FOV.

  • Motion. Motion in a scene strongly attracts the viewers’ attention. Less motion in the scene leads to a better recognition of individually moving objects. Motion is particularly efficient at attracting attention in the peripheral vision. If the camera is moving, chances are high that the focus of attention will stay on the direction of movement.

  • Sound. 3D sound is highly effective for guiding the viewer’s attention. Mono sound sources can make the viewer search for the sound source. Sound is especially effective, when combined with visual cues.

  • Context. The story itself can affect the expectations of the viewer and thus also his gaze. If there is anticipation that something is about to occur with a certain scene element, then it will most likely attract and keep the viewer’s attention. By arousing and/or belying expectations, context related cues can create suspense.

  • Perspective. The space and perspective can guide the viewer’s attention. Just like in paintings, parallel lines are usually followed with the gaze to a vanishing point. Large, salient objects are usually tried to be captured as a whole. Objects, very close to the viewer attract generally more attention than similar objects which are further away.

In the scope of this study, we did not examine all possible cues. Other important cues are contrast, lighting, signs and signals [1].

3.2 The Range of Viewer Roles. Active Participant or Passive Observer?

The role of the viewer has to be considered carefully in every VR and 360-degree film. The viewer takes the position of the camera so he does not look at the scene from outside but is actually in the scene. The role which a viewer plays in the scene is crucial for his or her experience of it. There are generally two possible situations:

  1. 1.

    The viewer is only an observer with no connection to the scene

  2. 2.

    The viewer is part of the scene

The two situations are quite distinct and have a huge impact on the experience. Therefore, a filmmaker needs to define the role of the viewer very clearly. Making the viewer part of the scene is what utilizes the full potential of VR, although there might be situations in which this is not wanted. Both concepts have pros and cons. The following table illustrates the consequences on the experience (Table 1).

Table 1. Features of active and passive VR experiences

Most of the effects in the left and the right column contradict each other (c.f. Narrative Paradox). It is important to consider the advantages and disadvantages that come with the two different concepts. The strongest argument to make the viewer become part of the scene is the higher probability for a feeling of presence to occur.

3.3 Placing the Action and Story Elements

The positioning of the action is very important. To exploit the whole potential of the 360-degree space, action could be placed all over the scene. However, it needs to be considered that many viewers watch 360-degree movies while seated. The online survey illustrated that VR is expected to be as interactive as possible, but when it comes to (VR) film, people still seem to see it as a rather passive medium in which they sit back and consume, just like they are accustomed to with the classical lean back medium film.

Therefore, one should attempt to place primary story elements in the front of the scene, in relation to the initial viewing direction (IVD) or to the seating alignment respectively. The test results showed that the IVD is accepted as the “correct” viewing direction for both sitting and standing viewers, and the action is usually anticipated to begin there. Hence, the attention usually goes back to the IVD after the orientation phase, except when an attentional cue leads to a potential point of interest (POI) somewhere else (Fig. 1).

Fig. 1.
figure 1

Extended staging for seated VR-viewers

The figure above illustrates this concept [1]. In the front 180-degree, story elements can be placed at any time. The closer the elements come to the back (blind spot), the less they should stay there. Hence, the main events, relevant to the plot should mostly happen in the front. However, secondary story elements can and should be placed all over the scene. This way, standing viewers can enjoy the full potential of 360-degree spatial storytelling in VR, while seated viewers do not have to crane their necks to follow the plot.

This concept is a compromise in favor of the seated viewer. It limits the creative possibilities in some situations. If a filmmaker wants people to experience his film while standing, for example because the narration needs him to walk around in the scene, it is recommended to give this information to the viewer prior to the experience.

However, placing story elements at the blind spot can be used in terms of narration, for example to surprise the viewer with elements he did not see before. Also, for standing viewers it can make sense to begin the experience in one direction and to begin the action in the opposite direction. The viewers have then already seen everything that is located in the starting direction and are less likely to return to it. This consequently increases their focus on the story elements in the new direction.

3.4 The Balance of Spatial and Temporal Story Density

In game design, environmental storytelling is a common term for a narrative that unfolds in the space of the scene, rather than being presented to the viewer in a linear fashion through time. Referring to VR experiences, Unseld coined the term Spatial Story Density [11] to define the amount of story elements that are arranged in the space of a scene simultaneously. We contrast Unselds spatial story density with the temporal story density, or simply the pace of the scene. We assume that a viewer can only follow a narrative sufficiently when temporal and spatial story density are aligned with each other.

Figure 2 illustrates this coherence: The blue bars represent primary story elements. It is assumed that all elements must be comprehended to understand the plot. The orange line represents the mental effort required by the viewer in order to comprehend the plot. With high temporal story density, narratives are fast paced, with high spatial story density, many narratives are happening simultaneously. Mental effort increases, when temporal or spatial story density is high. In the figure above, the viewers have more time to capture the story elements when they happen simultaneously. Temporal and spatial story density are separated in that scene.

Fig. 2.
figure 2

Schematic representation of temporal and spatial story density during a VR-film. The blue bars represent story elements with any visual or auditory information, the orange line represents the viewer’s mental efforts. (Color figure online)

Figure 3 shows the effect of excessive concurrent temporal and spatial storytelling during a moment of a scene. It becomes impossible for the viewer to keep track of all things happening around, the mental effort exceeds a certain threshold, and the scene just overwhelms the viewer. Consequently, the viewer is confused, misses some of the narrative elements and cannot comprehend the plot anymore.

Fig. 3.
figure 3

Mental efforts accumulate and can exceed a threshold when spatial and temporal storytelling is too high.

Therefore, it is very important to design the narration with a balanced spatial and temporal story density, so that the viewer is neither bored nor suffers from sensory overload.

3.5 Rethinking Framing

VR covers 360 degrees of the scene, so the typical framing as in normal cinema is not possible anymore. The framing is determined by the FOV and the viewing direction. The latter is constantly changing with the viewer’s head movement. The human FOV is usually confined by the human field of vision but current HMDs further limit the FOV to around 110 degrees. This mirrors approximately the binocular field of view in which we can perceive depth by means of stereopsis.

In VR, framing and camera position correlate with each other even more than in normal film. If the mise-en-scène cannot be changed flexibly around the camera, the choice of the camera position might be the only way to compose the image. In order to create a pleasing experience, it is advisable to stage the action around the camera. The influence of the distance between an object and the viewer/camera should always be to be considered. If a character in VR comes very close to the camera, this cannot be considered as a simple magnification, like a close-up. Instead, it has a direct effect on the viewer’s experience. A filmmaker should be aware of proxemics when creating the experience. Closeness of virtual characters could, for example, frighten the viewer or make him feel empathy for them. Whether the effect will be positive or negative usually depends on the context of the scene.

3.6 Rethinking Editing

Jumps in space and time break the immersion and the feeling of presence, since such experiences do not exist in reality. Especially if the feeling of presence is intended in a VR film, it is preferable to avoid cuts.

However, there are techniques to guide the viewer more intuitively from one scene to another. Crossfades and fades to black are a much safer method for changing the scene. This way, the viewer can prepare for the change. However, they also have disadvantages as they slow the pace down and generally indicate a transition between two scenes.

We successfully tested Brillharts concept of probabilistic experiential editing (PPE):

The most salient points of interests in two consecutive scenes should be aligned to present the viewer the next relevant element (Fig. 4): Element 1 is moving in the scene from left to right. The editor assumes that the viewer will follow element 1 with his gaze. After the cut to scene 2, the viewer is directly looking at element 2.

Fig. 4.
figure 4

Probabilistic experiential editing, according to J. Brillhart

Although element 2 is presented to the viewer, he might still need to orientate in the new scene. Therefore, action relevant to the plot should not begin right at the beginning, even if the viewer already knows where the action would most likely begin.

Generally, orientation views weaken the initial attention to the POIs. Strong POIs that are closer to the viewer or in which an action is anticipated can suppress the orientation views to a certain degree. In order to increase the probability that a viewer looks at an in-point at the beginning of a shot, several in- and out-points can be placed in the shot. However, there should preferably be fewer out-points than in-points.

Since 2017, VR filmmaker have begun to use hard cuts more courageously. It can be assumed that people are getting more accustomed to cuts in VR and consider them less distracting and immersion breaking. In the end, this is also a matter of visual habit.

4 User Study

In order to verify the numerous assertions, which have been made in the previous chapter, we conducted an empirical viewer test with 50 test participants (TP) and an additional online survey with 88 online participants (OP). TP were 62% male and 38% female. Of the TPs, 64% were between 18 and 29 years old.

In order to additionally address more VR-experienced users, the online survey was posted in VR related social media groups. The average age of the OPs was older (51% between 30 and 49 years). The objective of the tests and survey was to clarify essential question concerning the attention, perception and comprehension of viewers during VR films, and thereby determine approaches to the afore-mentioned challenges of VR filmmaking. More precisely, this included methods of guiding the attention with sound, visual and context-related cues; the coherence of spatial and temporal story density; the importance of the orientation phase and the IVD; effects of PPE, effects of dolly-shots, differences of seated and standing experiences and story comprehension in scenes with multiple POIs.

As this research took a rather comprehensive approach to cinematic narration in VR, the aim was to integrate several issues into the test design. Furthermore, statistical data was collected to put the expected and unexpected results into a broader context.

4.1 Implementation

The main test consisted of six consecutive 360-degree videos, lasting seven minutes altogether. TPs experienced these on the head-mounted display (HMD) HTC Vive. They were chosen to analyze a number of different situations and to test related assumptions. The test video included only fragments of original videos which were freely accessible on YouTube.

The test had three parts. In the first part, viewers answered some general statistical questions. In the second part, they watched the videos. In the third part, they answered questions about their experience and their comprehension of the scenes. The whole test took approximately 20 to 30 min per participant.

While the viewers watched the videos, their FOV could be followed on a computer screen and the sound could be followed on extra headphones. This way, their attention and reaction to visual and auditory cues was monitored during the whole test. The viewing direction could indicate the attention that a viewer paid to a specific POI. As the viewer’s eye movement was not tracked during the test, the FOV only gave an approximate measurement of the viewer’s attention. However, research has shown that the face pose follows, with only a minor delay, the eye’s looking direction and is hence a reasonable indicator of the visual attention [10]. Especially in scenes in which different strong potential POIs could not be seen simultaneously in the FOV of the HMD, the estimation of the viewer’s attention could be expected to be relatively precise.

The participants were divided into two groups. Group A was asked to sit on a simple chair with backrest (no swivel chair) during the first two videos and then to stand up for the remaining videos. Group B was standing from the beginning.

First, 37 true/false conditions were defined (i.e. a specific cue caused attention or did not). The TPs FOV was then observed and the results of these conditions were recorded.

These were then evaluated together with the answers to the 23 questions which were asked (Table 2 and Fig. 5).

Table 2. The test video contained clips from these 360-degree films
Fig. 5.
figure 5

Screenshots of 360-degree films used in the test video. No. 1: video 1, no. 2+3: video 2, no. 3: video 3, no. 5+6: video 4, no. 7: video 5, no. 8: video 6

4.2 Results

Seated and Standing Position, the Initial Viewing Direction and Its Effects on the Viewers’ Attention

For VIDEO 1 and 2, the viewing direction and behavior in a seated and standing position was compared. The test results showed that the attention on a POI behind the viewer is 40% lower when viewers were sitting. In VIDEO 1, less of the seated participants looked around when they heard the voice of a woman standing next to the camera, while the standing participants tried to locate the voice with their view. Once spotted, more than twice of the standing participants were more likely to follow the position of the woman and kept the attention on her (Figs. 6 and 7).

Fig. 6.
figure 6

FOV observation with divided acquisition for group A and B. The conditions on the y-axis were defined as true or false. The x-axis shows the percentage of positive results for each condition

Fig. 7.
figure 7

FOV observation with undivided acquisition. The conditions on the y-axis were defined as true or false. X-axis shows the percentage of positive results for each condition

Almost every TP looked up to the cathedral in front of the IVD. It seems that for structures that do not fit in the FOV, the viewer attempts to capture them as a whole. This effect can be used to guide the view through a scene.

In the intro of VIDEO 2 (wooden elevator into the arena), the FOV of both participants is mainly in the front 180 degrees in relation to the IVD. While all sitting participants were looking primarily to the front, this applied to only 72% of the standing participants. Even the standing participants who turned around, eventually turned back to their IVD, even though all sides of the elevator looked exactly the same. Both groups of participants expected the IVD as the “correct” direction in which the filmmaker had oriented them. Therefore, action is anticipated in front of the IVD, at least until any attentional cues prove otherwise. After the gladiators faded in, it became apparent that participants of group B changed their view to the back while viewers of group A mostly looked to the front. For a third of all participants a predominant viewing direction was not recognizable. At the climax of the video, when the emperor appeared at the tribune, far fewer sitting than standing viewers turned their head to the emperor or had already looked into his direction. So even the strong endogenous attentional cue that the gladiators gave (gaze and salutation) could not make half of the group A look into his direction.

While these results were generally expected, their distinctness was still surprising. Especially in the moment when the emperor appears, it was assumed that more seated viewers would have looked at him, despite the inconvenience of craning their neck. Most of test group A perceived the seated position as uncomfortable when they wanted to look around. Consequently, only 12% of this group preferred the seated position during the experience.

However, the first two videos were multiple POI scenes that required the viewer to look around. In contrast, most of the VR films produced today do not demand that the viewer turns to the back, so that sitting in a chair would not restrict him in any way. Consequently, the OPs who had more experience with VR films gave different answers. Only 12.6% preferred VR films when standing to 29.9% when sitting. For 56.3% of the OPs it depended on the film.

Slightly more TPs of group A disagreed (20% ) or partially disagreed (36% ) that the action should mainly happen in front of the viewer. Potentially the inconvenience of the seated position was not the main reason that they preferred to stand. Instead, the wish to explore the whole 360-degree in a VR experience could be a motivation to favor a standing position, since it allows much more freedom to move. For the OPs, 17.4% agreed and 58% partially agreed that the action should be in front of the viewer, so that he does not need to turn around.

Viewers Behavior on Probabilistic Experiential Editing (PPE)

In VIDEO 3 and 4 we observed the effects of hard cuts and PPE and the orientation phase after each cut. There are 6 hard cuts in the VIDEO 3. The shot between the first and the second cut is 18 s long whereas the other shots are 8–10 s long. The PEE of a cut was considered to be “successful” when one of the in-points was in the center of the viewers FOV after the cut. In the second cut after the longest shot, 84% of the viewers looked at the only POI of the scene at the moment of the cut, which was the best result.

For almost half of the viewers, orientation views significantly reduced the attention to the POIs. It was very obvious that viewers needed an orientation phase after every cut. Therefore, just after the cut, the viewers’ attention stayed at the POI only for a moment, before they would begin to look around. When no other POIs were found and the whole scene had been explored, viewers turned their head back to the POI from the beginning. This worked best in shot 2, which had only one salient POI in the scene and the viewers had enough time to look around. Most viewers needed around 15 s before they moved their attention back to the POI. This was just the right timing for the cut to the next shot. In the other, shorter shots, viewers did not have enough time for this orientation phase and therefore often did not look at the intended out point at the moment of the cut. Nevertheless, shot 4 worked well, probably because the POI was relatively near and a possible interaction between the two people was attracting more attention. Therefore, viewers were more likely to suppress their orientation views in order not to miss an anticipated action. Hence the cut worked quite well, even if the shot was much shorter than the second one. Cut 5 still worked for 74% of the viewer, even though the preceding shot was short and the POIs were not very close to the viewer. This can be explained with two probable out-points (2 persons) and three in-points (3 persons). Therefore, the multiple in- and out-points increased the probability for this cut to work.

When interviewed after finishing the test, most of the viewers remembered the hard cuts in VIDEO 4 (78% ) and VIDEO 3 (59% ). The recognition of the cut in VIDEO 3 was surprisingly low, as it was the first hard cut of the experience. At least for two TPs this cut caused noticeable reactions of confusion, which might have been an indication of the breaking of the immersion. However, the low recognitions rather indicate that the cut was not perceived as negatively as expected.

84% of all respondents agreed or partially agreed that hard cuts have a negative effect on the VR experience. Only 5% and 14% did not agree or partially did not agree on that statement. There was no significant difference between the answers of TPs and OPs. That means that hard cuts generally seem to have a rather negative effect on the experience, even when PEE is applied. Although PEE can make the editing in VR much more pleasant for the viewer, the cuts are still perceived as immersion breaking.

Attention During Orientation Phase

VIDEO 5 tested the viewer’s attention on a POI with or without an orientation phase. Therefore, we created different versions for each test group: Test group A saw the full length of the video, while for group B the first 17 s were cut out completely and the IVD was turned around for 180 degrees. This way, the viewers of group B saw King Louie right in the beginning, in which he immediately began to speak to them.

In effect, group A paid more attention to King Louie than test group B which did not have the orientation phase. Group B looked around while the giant ape was already speaking to them.

However, King Louie was still the strongest POI in the scene and therefore attracted most of the attention of all viewers. Nevertheless, the lack of time to orientate in the scene clearly weakened the attention to the POI which confirms the results of the previous videos. According to the viewers’ attention, the 17 s orientation phase seemed to be a sufficient time to explore the scene and facilitates to keep focus on the action afterwards. The orientation time that viewers needed in VIDEO 4 principally confirms this result.

Still, the orientation phase depends on the complexity of the scene and the viewer himself. For the first VR film of the Oculus Story Studios, “Lost” (2015), director Saschka Unseld measured an optimal orientation phase of 40 s [11]. Although he referred to “Lost”, it can be compared quite well with VIDEO 4, as both videos have no salient POIs during the orientation phase.

The gap between Unselds estimation and the test results could be partially explained by the increased VR experience of the test participants, as participants with a lot of VR experience have proved to need less time for orientation. Especially TPs who have had more than 30 VR experiences before just needed a very short orientation phase. They turned the head around very quickly in the beginning, capturing the whole scene in just a few seconds. Those more experienced viewers usually turned their head back to the IVD already after 5 to 10 s and waited for the action to begin. This was a good indication that they finished their orientation phase, ready to concentrate on the action. Less experienced viewers moved their head much slower and therefore needed more time for the orientation phase or did not turn around at all.

In group A, which faced in the opposite direction to King Louie at IVD, 92% of the TPs mostly kept the IVD until they heard his voice. Therefore, 88% did not see him before he began to speak. When they heard his voice, most of them began to look for the source of the voice. Except for one participant, all viewers in group A found King Louie after a few seconds and turned around with their whole body, accepting the new direction with the primary POI as the new “front” direction. Those results basically confirm again the assumptions regarding the preferred orientation to the IVD as well as the importance of the orientation phase. Also, it proves that mono sound sources that attract the auditory attention of the viewer, make the viewer look around to find this sound source in most of the cases. However, with 3D sound the viewer would have most likely found King Louie much quicker as his voice could have been immediately localized in the scene.

Gaze Interpretation

The second difference in VIDEO 5 for group A was the muted first sentence of Balu, when the bear enters the scene behind the viewer. This way, only King Louie’s angry gaze guided the viewer’s attention to Balu. Group B instead heard Balu’s first sentence simultaneously with the change of King Louie’s gaze towards Balu. 96% of group B interpreted the auditory and visual cue correctly and immediately turned around to Balu. In group A, still 72% turned around to Balu, but only because they interpreted and followed King Louie’s gaze to Balu without the auditory cue. However, the reaction of group A was significantly slower with a delay of 3 to 5 s. This seemed to be the time that was needed for most participants to interpret the endogenous cue of Louie’s gaze. 28% of the participants in group A could not interpret King Louie’s gaze in such a short time and did not turn around before Balu’s voice appeared, which was 7 s after King Louie looked at Balu. Overall the test results show that gazes can be good endogenous cues but take a long time be interpreted when they come unexpectedly or out of the context of the scene. Likewise, auditory cues that cannot be localized can make the viewer look for the source of the audio. The combination of mono audio sources and gaze has a very high probability of directing the viewers gaze as intended. It proves the effectiveness of diverse attentional cues, especially when addressing multiple senses.

Interestingly, VIDEO 5 also illustrated the strong effect of proxemics in VR. The closeness of the giant ape left such a strong impression that when TPs were asked afterwards, they falsely remembered it as stereoscopic. Likewise, the stereoscopic VIDEO 4 had been perceived as monoscopic, since characters were staged further away from the camera. Although this was a purely subjective impression, it seems that the effects of the technical immersion are sometimes overrated.

Attention in Multiple POI-Scenes

VIDEO 6 had a fast paced narrative with multiple POIs: An alien crashes down on earth like a meteorite. A woman approaches the crater and picks up a device from the ground. Accidentally she shoots at the alien with the device. The alien becomes aggressive, grows, and starts to chase the woman. Here we examined the correlation of high temporal and spatial story density and its effects on plot comprehension.

Of the several POIs in the scene, two were important to the plot (woman and alien). Interestingly, all viewers focused on a waste bin, which the camera approached in the beginning. As the camera came closer it seemed to be a strong POI even though it did not have any relevance to the story. It was even more surprising that the fast-moving meteor crashing down on the street was not recognized by some viewers.

After the camera movement stopped, the viewers began to look around to orientate in the scene. More of the viewers focused on the alien (46% ) than they did on the woman (14% ). The rest of the viewers changed their view between woman and alien at least once. The strong focus on the alien can also be explained with the subtle dolly shot towards the crater. Therefore, more participants looked at the crater compared to the woman. The reactions seen on all camera movements confirmed that the viewer is most likely to face the direction of movement. At least 40% of the viewers recognized both alien and woman as relevant POIs. However, the most important moment for understanding the plot - when the woman picks up something from the street - was missed by most of the viewers as they were focusing on the alien at that moment as a much stronger POI. While focusing on the alien, viewers were unable to see the woman as she was clearly out of the FOV (110 degrees) (Fig. 8).

Fig. 8.
figure 8

(Source: Google Spotlight Story: HELP)

The alien attracted more attention, so most of the viewers missed the short moment, when the woman picked up the object.

This test proved the importance of carefully planned temporal and spatial storytelling as well as how the FOV has to be taken into consideration when staging the action. More precisely, it has shown that two simultaneous POIs which are relevant for the plot need to be within the FOV of the HMD, as it cannot be expected that viewers will change their view between multiple POIs at the right moment or even at all.

Comprehension of Story

After the test videos, the participants were asked three specific questions on VIDEO 6 to evaluate their comprehension of the plot (Fig. 9).

Fig. 9.
figure 9

Viewers comprehension of plot in VIDEO 6

The viewers who did not look at the woman at the important moment when she picked up the object obviously did not recognize it. Consequently, not even half of the TPs could answer a question about where the woman got the object from. This would result in a poor comprehension of the story and plot. At least, 68% of the viewers recognized that the woman shot a blue beam of light with “something” at the alien and that it became aggressive because of that. The blue beam visually connected the alien and the woman which made the viewers finally looking alternately at alien and woman.

In the end, only 20% of all viewers could answer all three questions correctly: “Where did the alien come from? Why did the alien become aggressive? Where did the woman get the object from?” 8% of the viewers could not answer any of those questions. Therefore, the scene failed in the sense that viewers could not follow the plot. VIDEO 6 is a good example of what can happen when temporal and spatial story density is too high. Viewers cannot pay attention to several POIs that occur at the same time or close together temporally. However, only 57% of the TPs felt to have missed anything in the scene.

5 Conclusion

This study verified several methods which have already been used but not yet proven. It also introduced new models, such as extended staging for seated VR-viewers and examined the coherence between temporal and spatial storytelling.

In summary, it can be stated that many of the methods which have already been applied intuitively by VR filmmakers or that have been developed through several iterations can be confirmed as reasonable and successful. The results of the previous chapters might be used as guidelines for VR filmmakers to help them deal with the challenges of cinematic narration in VR. With no doubt, there is further research necessary to get a deeper understanding of each particular challenge that comes with VR filmmaking. Furthermore, we recognize that, as usage increases, not only is the production of VR content evolving, but so also is the viewer’s use and perception of this content.

Therefore, future work on this topic might develop its ideas further, and increasingly independent from the context of established media formats.

In a direct comparison of established filmic techniques with the possibilities of VR film, the limitations of VR are still evident. Likewise, there are a plethora of new possibilities and techniques to tell stories in VR. It seems that from an artistic point of view, we do not necessarily need to find a replacement for every established filmic technique. Instead, VR filmmaker should also try to discover a completely new approach to narration in VR films. Ultimately, the real challenge lies not in finding a new way of editing, framing, or other established filmic tools, but in the creation of an unforgettable experience for the viewer which lets him immerse into a story and its virtual world.