Abstract
Empirical study designs in HCI evolve in response to temporal realities and technological advancements. In this context, virtual reality (VR) shows potential for new empirical research designs, going beyond the (still) quite dominantly lab-based research roots in HCI. Previous work has been conducted to identify the use of VR for gathering non-homogeneous and representative sample populations and for conducting empirical studies in resource-constrained environments. Yet, it is unclear how VR empirical user study designs affect the participants’ behavior and experience, potentially influencing the study results compared to in-situ/in-lab studies. In this paper, we conducted a gesture elicitation study (GES) in a realistic physical smart room and its digital duplicate in VR. Sixty-six participants’ responses were collected using standardized questionnaires along with between-group gesture agreement analysis. Our comparison shows that the VR study produces a higher number of unique gesture proposals and similar best gestures to the in-person study for 95.4% of the referents, with minimum influence on the gesture proposals. We further discuss the usability, pragmatic and hedonic qualities, presence, task load, and implications of using VR for GESs, and highlight future directions for using VR-based empirical study designs. We found that VR can produce reliable data and improve participant experience with the same task load, making it viable to conduct remote GES and a substitute for conventional lab-based experiments.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Empirical studies are fundamental for Human-Computer interaction (HCI) research and interaction design (Norman and Draper 1986). Researchers and designers utilize empirical studies to observe and analyze human behaviours (Wobbrock et al. 2009), for prototype evaluation and to identify possible future improvements (Salovaara et al. 2017), evaluate user experience (Bargas-Avila and Hornbæk 2011), etc. These studies use methods from various domains such as surveys, interviews, in-lab, fieldwork/in-situ, probes, etc,. Nevertheless, Koeman’s (2020) review shows that in-lab studies are the most popular method by far, with 42.5% of HCI studies using this approach. One such famous lab-based empirical study with strong in-lab and in-person design roots is Gesture Elicitation Studies (GESs). GESs are utilized in gestural interface design and development. Wobbrock et al. (2005) introduced this empirical study to the field of HCI and thereafter, researchers and designers have used it to collect input preferences of the end-users when designing and developing gesture-controlled interfaces. In its initial experiment design, a researcher/designer invites potential system users to a lab, then present them with a referent (certain effect of a system), and asks the participants to propose a gesture (known as a symbol) to invoke the given referent. Then the researcher/designer triggered the referent making the participant believe that their performed gesture has caused that effect. Here the researcher/designer act as a wizard and this technique is called Wizard of Oz (WoZ) which was the widely used technique attached to this empirical study design.
Since its inception, this empirical design is followed by many HCI practitioners in different applications, and research and gained popularity. With more than 200 published studies employing the method (Villarreal-Narvaez et al. 2020; Magrofuoco and Vanderdonckt 2019; Villarreal-Narvaez et al. 2024), it was considered as well utilized empirical design in gestural interaction design. Despite the wide use of this in-lab study design, it inherits some known drawbacks which are common in most in-lab studies in general as well. To summarise a few, (1) in-lab GESs design limits the number and diversity of the participants, hence limiting the representativeness of the study results. In general, symbols elicited from a large and diverse group tend to be preferable compared to smaller samples (Morris et al. 2010). (2) generally studies are conducted in series, (3) require both researcher/designer and participant availability simultaneously, (4) has logistical challenges, e.g., developing and maintaining the lab setup, (5) reproducing or adding new samples (participants) requires setting up of the lab again, and in some contexts, Hawthorne effect, less ecological validity in the setup also impacts the elicited proposals. In addition, with the recent pandemic, HCI research was challenged as the labs were closed and in-person experiments were restricted due to safety concerns. This made conducting in-lab GESs extremely difficult. Thus, researchers have proposed amendments, enchantments, and alternatives to this empirical study design to overcome these challenges and to fit the study into different contexts, applications, and use cases. As a result, the focus on remote and encapsulated GESs have increased. Researchers started exploring new empirical designs and tools such as Gelicit (Magrofuoco and Vanderdonckt 2019) and Crowdllicit (Ali et al. 2019) which are based on web technologies. While these design helps to conduct remote GESs, these have limitations such as not being able to create ecological validity for use cases like smart homes, car infotainment system interaction, etc. Recently, the advances in technologies such as Augmented Reality (AR) and Virtual Reality (VR), and their inclined reach of consumers made researchers to further investigate improving said challenges using these emerging technologies.
Previous work by Voit et al. (2019) shows that VR brings a similar resemblance to in-situation studies and has the potential to act as a proxy for real-world experiments that are either difficult to conduct or has drawbacks with their in-lab designs. Further, Mottelson (2017) comparison of in-lab VR and remote VR experiments shows that there were no significant differences between the effects of experimental conditions. Recently, VR has increased its reach to consumers enabling researchers to reach a wider audience and the majority of VR Head Mounted Displays (HMDs) have very good hand tracking capabilities allowing them to conduct gesture elicitation and bring ecological validity (Kourtesis et al. 2021; Perera et al. 2021) to the experiments. VR-GES proposed by Perera et al. (2021) and Welict by Bellucci et al. (2021) show successful designs for conducting GESs using VR. While this literature shows a promising direction towards overcoming the challenges faced in the initial GESs design, it is important to identify what effect VR might have specifically for behavioral studies such as GESs when used as a proxy for the real world.
In GESs, designers and researchers invest significant effort in conducting the study and identifying user-defined gestures (Magrofuoco and Vanderdonckt 2019). Thus, it is vital to pay attention to the experiment design in advance aiming for reliable results. Among the several new empirical designs proposed to address the challenges in existing in-lab GESs, it is common that these different empirical study methods may entail their unique advantages and disadvantages which might affect the results of a study. For example, a web base method may use online surveys and photos and videos to present referents, while lab studies can use real devices. Further, lab studies may be always supervised by the investigator but in a remote setting, it may not feasible. In a VR setup user may immerse well into the environment, yet, in the in-lab, without ecological validity, participants’ proposals may differ. Especially, the process of discovering gestures must occur naturally or in a close-to-real system (Henschke 2020). If techniques such as WoZ is used, then preserving the illusion of system autonomy is a paramount (Henschke 2020). Compared to in-lab design, VR-based GES designs may approach these challenges differently. Nonetheless, while these new methods aim to overcome drawbacks in the current elicitation study design, it is important to investigate the effect of these new methods when attempting to make GES remote and encapsulated.
Comparison of different methods such as online surveys, in-lab, and remote studies have been investigated in the previous work (Nielsen et al. 2006; Rogers et al. 2007; Mottelson and Hornbæk 2017; Kjeldskov et al. 2004). However, these comparisons reveal both similar and different results depending on the study type and context. This shows that different methods and their contexts can affect the study outcomes which could have implications for empirical research and the interpretation of scientific work. Therefore, it is important to conduct study-specific investigations to find the suitability of empirical designs. To date, as of our knowledge, there were no studies conducted to compare the effect of using VR for behavioral studies such as GES. Therefore, While researchers have attempted to use modified designs for GESs, we do not know how these new empirical designs themselves affect the results compared to the established methods such as developing actual physical setup/closer-to-real setup and conducting elicitation studies. Will it still produce the same results given that study is conducted with the same set of referents for the same environment remains an open question.
In this work, we investigate the effects of using VR as a medium for GESs. We were mainly interested in knowing what are the differences and similarities that participants experience when VR is been used as a medium to conduct GESs for different use cases. For instance, to study smart home gesture interactions, should researchers develop a smart home setup and conduct a GES or can VR be used as a medium, such that what to expect? Will results be the same, worse, or better? Following the previous works that compare different methods focusing on the usability (Duh et al. 2006; Nielsen et al. 2006; Voit et al. 2019), participant experience (Sun and May 2013; Voit et al. 2019) and the results, we conducted a similar procedure to compare VR GES and real-world in-lab GES. For our comparison, we developed a real-world smart room and its digital clone in VR. In the real-world in-lab GES, we used actual devices and developed a WoZ interface to control them remotely to provide the illusion of autonomy. The digital clone was created with the same functionality in VR GES. Then we conducted two GESs and a between-group analysis to see how these two empirical designs affect the results of the study. We investigated the usability of the empirical designs, and participant experiences and compared the results of the two methods.
The remainder of the paper is structured as follows. In Sect. 2 we review the existing in-lab GES designs and how empirical methods comparisons are conducted in previous work. In Sect. 3, we illustrate our comparison study approach and the composition of in-Lab and VR GESs. Finally, we present the results and our findings in Sect. 4 and discuss the challenges and opportunities in Sect. 5 and propose our findings and suggestions and conclude in Sect. 8.
2 Related work
Our work is inspired by previous work that investigates the use of VR for HCI studies even allowing to overcome the challenges imposed during the pandemic. In this section, we looked at GESs that use initial empirical design, their characteristics, and the VR GES characteristics based on the previous studies. Then we looked at how empirical methods comparisons are conducted in the literature and followed similar approaches in comparing the GESs conducted using the initial design and VR design.
2.1 GESs in lab
Many researchers have utilized Wobbrock et al.’s (2005) original guessability method published in 2005 to design GESs to elicit interactions for different types of applications and domains. Started from eliciting gestures for tabletops in surface computing (Wobbrock et al. 2009; Morris et al. 2010), GESs usage were outspread to vehicles controls and infotainment systems (May et al. 2017; Fariman et al. 2016), smart phones (Ruiz et al. 2011), combination of devices such as smart phones, public displays (Kray et al. 2010; Nebeling et al. 2014), smart appliances such as televisions (Vatavu and Zaiti 2014; Wu et al. 2016; Dong et al. 2015), smart environments (Vogiatzidakis and Koutsabasis 2019; Hoffmann et al. 2019; Choi et al. 2014; Vogiatzidakis and Koutsabasis 2022) wearables (Gheran et al. 2018), AR (Williams et al. 2020; Piumsomboon et al. 2013; Pham et al. 2018).
In a GES, users, their tasks, platforms, devices, and environments are the key dimensions (Magrofuoco and Vanderdonckt 2019) that are considered when planning and designing the empirical study. Researchers may alter the study design according to the context of the application/research, yet, the majority of GESs share common characteristics such as the simultaneous and in-person presence of the experimenter and the participants within a lab setting, use of single or multiple cameras to capture a gesture proposals, uses either symbolic inputs or real equipment/devices, execution of the experiment in series, direct and real-time observation of the participant and use of WoZ technique. Figure 1 shows GES setups used in three different studies which depict the said characteristics and follow initial GES designs. While some of these characteristics may lead to several drawbacks, the in-lab study method was used for many applications and research as stated above. Hence, in our comparison study, we developed a GES following these initial empirical design guidelines and characteristics for the purpose of comparing it with the GES design that uses VR as a medium.
It is important to notice that sometimes GESs were conducted considering the whole environment. For example, smart-home as shown in Fig. 1, vehicle setup (Döring et al. 2011), public setting (Kray et al. 2010). This environment in which gestures were collected also plays a crucial role (Magrofuoco and Vanderdonckt 2019) for the outcomes of the study. Hollan et al’s. (2000) "Distributed Cognition" framework for viewing participants in their environment shows that people are not passive actors in their environment and a system impacts how an actor uses it. For instance, in Vogiatzidakis et al.’s (2022) study, the researchers prototyped spatial AR systems using projection mapping with foil mockups of devices to evaluate smart home interaction in a GES. This acknowledges the importance of ecological validity in the study setup for this type of GES which requires to consider the whole environment. However, due to the complexity of the setup in Vogiatzidakis et al.’s study, they used an in-lab setup only for evaluation, as conducting remote studies with this method would be challenging. Such that when eliciting participants’ input proposals, the environment setup is crucial for uncovering nuanced interaction information. Therefore, when we designed the two empirical designs for the comparisons, we paid special attention to maintaining similar environments in both empirical designs.
2.2 GESs outside the lab
Due to the drawback observed in the initial GES empirical design, researchers have come up with several alternative designs to overcome those. For instance, to bring GESs out of the lab and conduct them remotely to acquire a diverse set of participants, a range of designs were proposed. Among that there are a few studies that have used web-based tools to interact with participants, such as Amazon Mechanical Turk (AMT) Madapana and Wachs (2019), video conferencing tools, web applications where the investigators use techniques such as Wizard of OZ (WoZ), speak out loud. Magrofuoco and Vanderdonckt’s (2019) Gelicit, and Ali et al.’s (2019) ‘Crowdlicit’ are two such empirical designs that attempt to conduct GESs remotely using web-based tools. Both these studies were published at similar times and claimed as the first of their kind and they both share some similarities in terms of the design and the served purpose. Gelicit (Magrofuoco and Vanderdonckt 2019) is a cloud computing-based design for conducting a GES with distributed stakeholders. Gelicit supports the most widely used referent presentation mechanisms such as images, GIF animations, and videos. However, Gelicit focuses on eliciting gestures for graphical user interfaces only and limited its gesture-capturing capabilities due to the intrinsic limitations of HTML5. Also, Gelicit does not focus on creating an ecologically valid setup during the elicitation, especially in situations like driving, and smart home setups. Similarly, ’Crowdlicit’ (Ali et al. 2019) also proposes a web-based design for distributed elicitation and identification studies. For referent presentation Crowdlicit facilitates text, audio clips, images, and videos only. Again this makes it less suitable for scenarios where the environment is important (e.g., elicitation for interaction designs in vehicles, smart homes, public displays, wearables, etc.). In addition to the video recording of the participants’ gesture proposals, Crowdlicit allows text and image recordings. However, the use of video recordings to capture proposed gestures may lead to concerns in the initial GES design such as participant privacy, potential Hawthorn effect, and video interruptions due to performed gestures obstructing the recording camera. In Crowdlicit design, the authors claim that it can address two legacy bias mitigation techniques (prime and production) proposed by Morris et al. (2014) which was not discussed with Gelicit design. Further, Crowdlicit facilitates unsupervised GESs as well. Both Gelicit and Crowdilicit cut down on resources required to conduct elicitation studies which address the logistical challenges inherited in the initial GES design. While these new empirical designs address several concerns of the initial GES design, they both acknowledge the limitation of not evaluating the results of these designs with a lab-based elicitation study and state that as an opportunity for future research which confirms the necessity of our work.
Chamunorwa et al.’s (2023) recent work presents an interesting direction, which seeminglu using a methodology inspired by exploratory HCI methods such as cultural probes (Gaver et al. 1999) and technology probes (Hutchinson et al. 2003). In this study, participants were provided with a DIY kit to assemble in their homes, following instructions to record themselves as they proposed gestures using a pillow as the interaction object. This GES was conducted in participants’ homes, outside the typical lab settings. The study included both moderated and unmoderated GESs, with thirty participants (14 moderated, 16 unmoderated) and 19 referents representing various smart home control actions. Key observations of this study design are that participants had to assemble the observation kit, and the referents were presented as text using ‘referent cards.’, the gestures were recorded via video, and participants were encouraged to think aloud as they performed the tasks. This study also reflects the researchers’ interest in investigating different methodologies to conduct remote and unsupervised GESs. However, several challenges emerged in this study. Participants’ gestures might not have been fully captured due to the fixed camera setup, and there was complexity in ensuring adherence to the study protocol in the unmoderated group, which involved a series of instructions that participants needed to follow. Additionally, since there was no automated referent presentation mechanism, more incomplete referent coverage was reported, as participants skipped certain tasks. In terms of reproducibility, the unsupervised nature of the study presents challenges in consistently replicating the experiment’s exact conditions. Moreover, making amendments to the study could be complicated due to the logistical challenges of remote kit distribution. The privacy concerns associated with recording participants in their homes and potentially exposing other household members must also need to be carefully considered. Thus, the authors propose further research, particularly in unmoderated settings, to improve the overall rigor and reliability of such remote studies.
Overall, with these new GES designs, remote and unsupervised studies became feasible. Yet, these web (only) based designs were not capable of addressing the ’environment’ dimension of in-lab GESs as discussed in Sect. 2.1 for creating ecological valid environments which are required in a GESs. Thus, with the learning from these web-based designs, researchers started looking at immersive computing such as AR and VR for remote and unsupervised empirical studies in HCI (Ratcliffe et al. 2021a, b).
2.3 Potential of out-of-lab studies with VR
With the incline of the VR HMD usage in the consumer market in the past years and its future growth predictions ,Footnote 1 researchers were able to reach a wider audience with experiments. For instance, Ma et al. (2018) highlighted they were able to obtain a more diverse sample compared to lab studies, in their behavioral experiments conducted in VR using crowd-sourcing. Steed et al.’s (2016) shows it is generally easier to recruit more participants for remote VR studies than for typical lab-based studies with their experiment on presence and embodiment in VR. Further Mottelson et al. (2017) comparison of both in and out-of-lab VR experiments yielded remote VR studies were able to collect reliable data and similar outputs, asserting the potential of out-of-lab studies using VR. Additionally, Voit et al. (2019) show that their comparison of AR, VR, Online and In-situ studies for prototype evaluation, VR and in-situ caused similar results which they had not expected beforehand. Thus, they highlight the open question for future work to investigate the effect of using different environments in VR, e.g., "a natural environment compared to a lab setup", and invite to further investigate the different effects between empirical studies using VR and in-situ studies.
With such characteristics and a proven record of successful use of VR experiments and its high resemblance with the in-situ studies, we looked for the suitability and effect of VR for behavioral studies such as GESs. However, there’s a paucity of research into how to effectively conduct remote GESs using VR. Gesture Elicitation for AR environments by Nascimento da Silva (2022) is a closer attempt but they are focusing only on AR applications (not using AR as a proxy to conduct GESs in other contexts). Also, they still extract videos from the HMD which may again bring concerns of privacy as the participant’s home/local environment may also be captured in addition to their hands. Perera et al. (2021) and Bellucci et al. (2021) were the first to attempt and discuss the use of VR as a proxy for GESs. They introduced two empirical designs named VR-GES and Welicit respectively. We looked at how VR-GES and Welicit cover the dimensions of a GES design discussed in Sect. 2.1 for the purpose of selecting an empirical design to compare the effect of using VR for GES vs in-lab GES design. While both VR-GES and Welicit designs focused on remote GESs, these designs had several differences in terms of data capturing gestures, involvement of the experimenter, capabilities of using legacy bias mitigation techniques, and facilitation for the gesture binning process.
According to Bellucci et al. (2021), their design record audio, video, and VR play of a participant and it has the potential to capture quaternion data and can combine other gesture tracking devices such as Leap Motion. Yet, does not discuss how its design can be used with native hand gesture tracking in VR HMDs such as Oculus Quest. Further, Bellucci et al. (2021) evaluated the system with Oculus Rift and external hand tracker Leap Motion which make it difficult for remote participants who may not have such additions. Further, during the analysis phase, they mainly used video capture and VR play of the participant. There was no discussion on how quaternion data was visualized or used during the analysis phase. Further, the video capture may restrict the gesture viewing to one angle (similar to the initial GES design with single cameras) which might cause difficulties during the gesture binning process. Their design follows the manual WoZ technique where the experimenter and participation are required simultaneous participation hence not unsupervised. However, they allow enabling multiple human wizards at the same time which could be a potential way to overcome the concern of execution of studies in series. However, similar to the initial GES design where parallel elicitation is strenuous. Perera et al.’s (2021) VR-GES design only captures real-time quaternion data of participant’s gestures and provides, a hand remodeling mechanism for the binning process. No VR play or participants’ hand data was video recorded. In a follow-up study, VR-GES discusses embedding legacy bias mitigation techniques (Perera et al. 2023) as well. Both the Welicit and VR-GES empirical designs have the capability to embed custom environments as required by the GES.
While both these empirical designs enable the use of VR for GES, in this study we aim to avoid capturing videos of participants, due to privacy reasons such as exposing their home/local environments. We were interested in extracting only privacy-preserving and lightweight data. Therefore, we decided to use Perera et al’s VR-GES for this comparison. Further VR-GES (Perera et al. 2021) was developed as a native VR application that has modularization where we could easily customize the elicitation environment without affecting the elicitation algorithm and there was no need for additional equipment as it uses native hand-tracking capabilities. Also, it only captures the hand skeleton data which has fewer privacy concerns compared to recording participant video or VR play, and is generated data is light-weighted. Also as highlighted in Saffo et al.’s (2020) VR study, it is better to limit the amount of data collected when using online tools or platforms as the service providers of these platforms may review the content. This is important when distributing the VR study among participants. Also, VR-GES allows us to remodel the hand gesture data and view them from any angle to help the classification process which was an added advantage. Thus, in our GES empirical comparison, we used VR-GES with the in-lab setup that we developed. We discuss this further in Sect. 3.
Overall, while these empirical designs address the major concerns in the initial GES design, to date there are no studies that compare these new designs with the initial GES design which is still widely used by HCI practitioners and designers. Thus, it is important to investigate how these new empirical designs that use novel technologies such as VR affect the results compared to established in-lab GESs.
2.4 Comparison of empirical methods
There is a limited number of studies comparing GES designs in the literature; therefore, we have presented a table (Table 1) that provides a general overview of various GES designs, extended from Koutsabasis et al.’s (2019) review. While specific comparisons of GESs are scarce several researchers have generally compared the effects and results of conducting in-lab studies vs online surveys or in-lab vs in-situ studies with participants. Dandurand et al. (2008) comparison of online vs lab methods for problem-solving experiment and Clifford et al.’s (2014) data quality comparison and Voit et al.’s (2019) prototype comparison are some examples. Thus, we will be reviewing such studies with the purpose of incorporating techniques and avenues they have looked in, for our GES empirical designs comparison.
During the comparisons, these studies focus on participant experience and usability aspects of the experiment design. Dandurand et al. (2008) comparison found that in-lab participants may be more committed and engaged with the experiment due to the investigator’s continuous supervision and hence be more accurate compared to the remote unsupervised survey participant which had higher dropout rates. They found out that distractions caused to remote participants could be a potential reason for this. In addition, Germine et al. (2012) investigation of data quality of in-lab and web-based experiments and Andreasen et al’s. (2007) empirical comparison of remote usability testing and in-lab testing are examples that focus on how ecologically validity works in remote experiments in addition to usability and participant experience. Furthermore, Sun et al. (2013) work show that the participant experience ratings in their field-based study were higher than in-lab because participants were affected by the positive ambiance in a sports stadium. This indicates the environment of an empirical study can affect the participant experience hence the study results thus, important to consider in a comparison (Sharp et al. 2007; Sun and May 2013). In addition to Sun et al.’s work (2013) which uses real context (a sports stadium), there are other studies (Kjeldskov et al. 2004; Duh et al. 2006; Nielsen et al. 2006) that compare the effects of experiment environment, hence, informed our comparison study. Thus, in our comparison, we will be comparing participant experience of the two empirical study designs along with participants’ immersion in the artificially constructed environments (both in VR and in-lab) in addition to the GES results. In addition, many study comparisons show that studies could differ based on the level of realism in the setup, performed tasks, and usability of the study design. Previous work has also focused on the strength and challenges of various types of study designs. Particularly, Gustarini et al. (2013) work highlights, the importance of looking at aspects such as recruiting bias, remunerating participants, participant cheating/negligence (in supervised and unsupervised conditions), type of data that can be collected, where and how data can be stored, participants’ motivation and privacy needs. Henceforth, our comparison is designed to capture these aspects as much as possible.
In summary, GESs use different empirical designs to overcome the concern inherits from their initial study design. Yet, these new designs also entail their unique advantages and disadvantages, which may affect the final results; the elicited gesture proposals. While there is no comparison on how different empirical designs of GESs might affect their results, previous work has investigated how similar or different empirical designs (of other empirical studies) affect their results and shows what aspects are important to look into when comparing experiments. Importantly, in addition to the final results generated from an empirical design, it is important to compare participant experience which is often missed when developing new empirical designs, task distribution, the usability of the empirical design, and experiment environment and participant behaviour (Dennis 2014; Field and Hole 2002; Gustarini et al. 2013). Especially with the recent trend in using VR and AR as proxies for HCI studies (particularly GESs), it is important to know how the methods and designs affect the results compared to established and popular methods such as in-lab studies.
3 Experiment
The purpose of this study is to compare and validate the potential of conducting remote and encapsulated GESs using VR outside the laboratory. For this purpose, we conducted two GESs, one that follows established in-lab empirical design and the other being out-of-lab GES conducted using VR. Both studies were set up following the design guidelines provided in the previous work to be representative of standard GESs. During our investigation, we compare the usability of both empirical designs along with the participant experiences and tasks they performed along with a comparison of proposed gestures and agreements. In this section, we discuss the study design and procedures.
3.1 Study design
GESs that we conducted were aimed to extract symbols to control smart home devices as listed in Table 2. Interactions to control smart home devices were widely discussed topic in HCI especially with the proliferation of concepts such as ambient assisted living (AAL), Internet of Things (IoT), and edge computing. Yet, studying smart home interaction in an ecologically valid setup could be logistically challenging in a lab setting as it requires to have real devices, environment setups, control systems, maintenance of the space and devices, etc. As highlighted by Magrofuoco et al. (2019) "conducting a GES represented a significant effort for identifying user-defined gestures", and this additional work may increase it further. For instance, GESs that followed the initial empirical design, such as Vogiatzidakis et al.’s (2019, 2022) and Hoffmann et al. (2019) have either asked users to imagine the real setup and/or use symbolic inputs such as paper posters, pictures, slides for the list of devices or equipment. In addition, the remote GES design such as Gelicit (Magrofuoco and Vanderdonckt 2019), or Crowdlicit (Ali et al. 2019) also might fall short when having to cater to this use case. Another, motivation for us to select this use case was VR-GES empirical design was originally proposed by Perera et al. (2021) is also focusing on the smart home environment. Thus, we decided on this use case to observe the optimum benefits of VR-GES compared to in-Lab (IL) GES.
Our experiment consists of two GESs that use two different empirical designs for the purpose of comparison. Thus, we used basic design (Lazar et al. 2017) with a single independent variable; empirical design. Empirical design is a between-subject that varies with two levels; in-lab (IL)-GES vs VR-GES. For legacy bias reduction we used priming in both designs. Each participant in our study was subject only to one empirical design, which made all the samples independent. Each participant group was presented with the same 21 referents spanning across 5 devices, in both GESs as listed in Table 2 below. While we used the Latin Square algorithm to balance the order that referents were presented to each participant, it is important to note that certain referents must be presented to the participant after/before another. For example, as we started the experiment with all devices turned off, ’turn off’ referents should be presented to a participant after the relevant ’turn on’ referent of that device is presented. An alternative design option that could have been used was to keep certain devices running (turned on) when a participant entered the experiment setup. Thus, they could perform the ’turn off’ referent before ’turn on’. Yet this still needed considerable counterbalancing. Therefore, we kept all the devices off at the beginning of the study for simplicity. For both our elicitation studies we maintain a similar experimental setup when eliciting symbols from participants. Thus, when developing the IL-GES and VR-GES we planned the same environment setup as shown in Fig. 2a. The real-world implementation of the setup and its digital duplicate (also referred to as digital model (Fuller et al. 2020) created in the VR is shown in Fig. 2b. Each GES design is explained below.
3.1.1 In lab GES (IL-GES)
As shown in Fig. 2b, we set up the lab with real devices with the ability to control them externally when conducting the elicitation study to create the illusion of system autonomy (Henschke 2020). We used the WoZ technique where the investigator plays the role of the wizard and the participant and the investigator are involved simultaneously. The studies were conducted in series, with two investigators (one after another) in the IL-GES to reduce the time. Thus, to keep all the instruction presentations consistent across all the participants we used a semi-automated system to control the referents as shown in Fig. 3. This has the ability to trigger the same voice instructions to the participant. Another reason to use this tool is, during our pilot study we observed that some participants had difficulty understanding the given instructions due to investigators’ accents and slightly varied instructions. With the tool, we reduce the speaking speed and kept the instruction similar across all the participants. The same instruction recordings were used with VR-GES as well.
Once the investigator starts the study this tool highlights the referent buttons that need to be clicked next. At the end of the study, the investigator will generate a text file which had the order of the generated referents, and experiment time. This was then used to match the video recordings from 6 cameras in the lab that records participants’ gesture proposals. The instructions provided in both IL-GES are the same as VR-GES and investigators provided no additional hints or clues to the participants to be consistent. For the gesture recordings as shown in Fig. 2b we used the traditional video-based recording mechanism with 6 cameras placed at 6 different angles to reduce the interruption (e.g., body parts covering the view of the camera when participants performed the gesture) that could occur during the gesture recordings. Also, cameras have to be placed a bit far to capture participants’ mid-air hand movements without going out of camera frames. A single instance of a participant performing a gesture to turn on the table lamp was captured from 6 different camera angles was shown in Fig. 4.
Once the participant had given their informed consent to participate in the study, they started the pre-questionnaire that covered the basic demographic and their previous experiences related to smart devices. Following that, the participant entered the setup and will be seated on the chair provided (See Fig. 2a). During the whole study participants were seated and had the option to stop the experiment at any time and walk out of the lab if they do not want to continue. Once the participant is seated, they were shown a video of a virtual character interacting with several devices (not used in the study) using hand gestures. This design was used by Morris et al. (2014) and Ali et al. (2021) to prime the participants to reduce the potential legacy bias proposals. Proceeding with that the investigator starts the trial. The trial is aimed to help the participant to understand the task they have to perform during the real experiment. The investigator can monitor the participant with the live camera mounted to the top of the TV, directly in front of the participant’s seating position. Once the trial is completed, the investigator starts the real study. After the reference is played, the participant presses the green button in front of them to indicate the start of the gesture and perform the required symbol (gesture) for the given referent. The participant can press the red button if they are not happy with the proposed symbol and the referent will be presented again. Once the investigator sees the participant finished the gesture, they initiate the next referent using the IL-GES controller. Once all the referents are finished, participants filled out a post-experiment questionnaire and provided their qualitative feedback on the experiment.
3.1.2 VR GES
For the VR GES, we developed the digital duplicate of the IL-GES lab setup. The digital duplicate is similar in size, placement of the devices, and appearance to the real setup. Also, we used virtual models of the real devices which preserved the ecological validity. Further, this mitigates the additional measure that we had to take in the IL-GES setup to reduce the safety concerns that may arise with the physical setup and the devices. For example, we had to use cord covers on the floor in the physical setup (See Fig. 2b) to avoid participants accidentally stepping into wires and to make the in-lab setup less cluttered. In VR-GES, for the referent presentation process, we used the same voice recordings that were used for the IL-GES, and Latin Square was used to decide the referent presentation order in the VR GES as well. Further, we used an automated WoZ technique in VR GES, i.e., in remote and unsupervised settings, the VR application automatically plays the next referent and activates the referent once the participant proposed a symbol. Thus the VR application plays the role of wizard instead of the investigator. In contrast to IL-GES, in VR-GES, we only collected hand skeleton data of participants and used Perera et al.’s (2021) 3D hand reconstruction mechanism to visualize the performed hand gestures during the binning process. Hand gestures can be viewed from any angle after the 3D reconstruction (see Fig. 6) which was easier than analyzing the captured video footage from 6 different camera angles. Unlike in other empirical designs, VR GES did not capture the VR play or the real video footage of the participant or their personal environments. The required hand skeleton data was extracted in real-time when the gestures were performed hence no video data was required. This generates lightweight data files per participant compared with IL-GES which included camera footage from 6 camera angles for each participant.
Similar to the IL-GES, once the participant had given their informed consent via online form to participate in the study, they started the same pre-questionnaire as in IL-GES. Since VR GES conducted remotely, participants either borrowed a VR HMD, which was either an Oculus Quest 1 or 2 device, or they used their personal VR HMD and completed the study on their own time. The study procedure was similar to IL-GES. VR application first plays the priming scenario and then will guide the participants to complete the trial and upon completion of the trial session, the real study begins. After the reference is played participants press the green button in front of them to indicate the start of the gesture and perform the required symbol (gesture) for the given referent. Unlike in IL-GES, in VR GES, once the participants press the green button, the virtual hands will turn red, and a red dot with the letters ’REC’ will appear in their field of vision (FoV). This procedure was followed in the original VR GES empirical design of Perera et al. (2021) and has received positive ratings from the experiment participants. Hence we followed the same procedure without modifications. An image capture sequence of VR GES is shown in Fig. 5. From left to right it shows, participants entering their unique identifier, participant view at the beginning of the study, a participant performing a gesture to turn on the TV, and TV responding afterward. Similar to IL-GES, participants can press the red button if they are not happy with the proposed symbol and the referent will be presented again. Finally, participants will remove the HMD and fill out a post-experiment questionnaire, and provide their qualitative feedback on the experiment.
3.2 Participants
Participants were recruited by publishing the study through our internal email list, social network, and word of mouth. We recruited 35 volunteers between the age of 21 to 62 (M = 33.7, SD = 10.8) for IL-GES. Of these, participants’ compositions were 23 male and 12 female. For the VR-GES, we had 31 participants between the ages of 21 to 58 (M = 32.2, SD = 8.9) with the composition of 20 male and 11 female participants. We had to drop 6 participants (5 male and 1 female participant) data based on the criteria described in Sect. 3.5 and for the purpose of counterbalancing the two studies. All the participants reported that they had a very good experience interacting with home/office appliances, and out of the 60 participants that were considered 73.3% had some VR experience and 86.7% of the participants were right-handed and there were no ambidextrous participants.
3.3 Apparatus and setup
For IL-GES, we developed the physical room with real devices for the purpose of this study and got permission to maintain the space with the setup for three months time. As shown in Fig. 2, for the table lamp we purchased an off-the-shelf lamp with a WiFi-controlled smart bulb, for the blade-less fan, we used a Dyson Hot and Cool Purifier,Footnote 2 for the security camera, we used TP-Link WiFi controlled general purpose camera and for window blinds, we used WiFi controlled Smart Blinds Driver and attach the blind pull code to it. For the TV, we used the inbuilt screen inside the room space and used Webex video conferencing and shared the screen. We developed a custom PowerPoint presentation replicating an actual smart TV interface with a home screen and several channels by embedding pre-downloaded videos. All the devices were connected to a single router placed within the room below the participants’ seating chair as marked in Fig. 2a. For the video recording, we used 4 GoPro Hero-9 cameras, one mounted on a tripod (camera 2 in Fig. 2a) and two others mounted to the wall and one on the ceiling. In addition, two iPhone 12 cameras (cameras 1 and 3 in Fig. 2a were used and mounted on two tripods. All the cameras had to be manually turned on before a participant entered the setup, turned off after the participant finishes the study, and need to back up the footage and charged after each participant.
VR-GES was developed using Unity Game EngineFootnote 3 extending Perera et al’s (Perera et al. 2021) VR-GES design. The real room measurements were taken and then modeled it on Unity using ProBuilderFootnote 4 and created the digital duplicate of the real setup. Further, the table, Dyson, and the TV were modeled using BlenderFootnote 5 an open-source 3D creation suite. The chair and security cameras were downloaded from Sketchfab.Footnote 6 After the development, the VR application was tested on Oculus Quest 1 and 2 hardware. The recorded hand data was saved into a JSON file which consists of the Quaternion data of absolute orientation of hands movements. Once a participant completed the study the JSON file is created and securely transferred to university servers using HTTPS protocol at the end of the experiment. If the participant stops the experiment in the middle, no data will be recorded or transferred. Participants were given instructions on how to delete the hand skeleton data or to uninstall the VR application completely from their VR HMDs.
3.4 Measures
Previous work shows that different empirical designs can affect the affect usability and the perceived participant experience which could influence the results of an experiment. Thus, in our GES empirical design comparison, we investigated each design’s usability, the participant experience it brings, task load difference, and finally the elicited gesture proposals. For the purpose of studying these elements of each study design, we collected quantitative and qualitative feedback from the participant using a set of standard questionnaires across the two study designs. In both IL-GES and VR-GES studies, participants answered the questionnaire in similar settings. i.e., in VR participants answered the questionnaire without wearing the headset using a 2D computer screen.
We employed AttrakDiff (Hassenzahl et al. 2003) questionnaire for the participant experience measurement which is an often used questionnaire in HCI to investigate the attractiveness of a system by assessing its pragmatic and hedonic qualities and attractiveness for the users. As previous studies highlight, it is important to consider the participants’ comfort when developing a new empirical design to avoid fatigue and frustration that might affect the experiment results as well. To study the usability of each design, we used System Usability Scale (Brooke et al. 1996) which is a widely used standardized questionnaire to assess usability. Even though both the empirical designs aim to elicit gestures in a smart room environment, each design may give a different perspective of the tasks participants performed. Thus, we used Raw NASA TLX (Hart 2006; Hart and Staveland 1988) to compare whether there is any task load differences that participant may encounter between the two methods. NASA TLX is frequently used to assess the physical, mental, and emotional demands of a task.
Since, illusion plays a major role in a GES that uses WoZ technique (Henschke 2020), to investigate to what extent participant believe the autonomy of the system (i.e., to realize whether participants thinks that the devices responded to their hand gesture), and the ecological validity, we used a seven scale Likert scale inspired by Banakou et al.’s (2013) body ownership illusion task. Body ownership illusions look at the illusions where participants perceive virtual bodies to be their own (Banakou et al. 2013; Kilteni et al. 2015). It is important to note that both participant groups were in similar setups yet they were both artificial environments created differently.
Furthermore, aiming for qualitative analysis we used several open questions to investigate participants’ likeliness (liked and disliked), experience, whether they will behave similarly in the real scenario, and general thoughts and concerns with the chance of descriptive answers. In addition to the questionnaires, we measured the task completion time (TCT) of the primary task and for answering the questionnaires. Finally, gesture agreement analysis was conducted using hand skeleton data of participants’ virtual hands in VR-GES and using coded video data captured in IL-GES. The agreement rates and the proposed symbols were then further compared and analyzed from both the GES designs.
3.5 Data validation
Similar to other empirical studies data validation is crucial for GESs as well. This is to ensure the quality and accuracy of the data collected from participants. Especially in a remote and encapsulated setting, attention and compliant participation are key because of the absence of a human evaluator during the study (Steed et al. 2016; Kittur et al. 2008). Therefore, validating the data helps researchers and designers to identify and eliminate errors or anomalies that may occur during or after the elicitation, which could otherwise affect the validity and reliability of the research findings. Therefore we defined the following conditions as exclusion criteria for this study with the purpose of validation.
-
Participants’ understanding of the task: We observed the trial data of both VR-GES and IL-GES first to investigate the participants’ understanding of the study. This helped to get us an overview of the data before spending time analyzing complete video or skeleton data from a particular participant.
-
Completeness and Missing data: This is to make sure, participants have proposed gestures for all the referents and all required questionnaires have been filled out completely and accurately. For example, in IL-GES there were instances where one or two cameras has stuck or stopped recording which left only several angles of data. In VR-GES, the skeleton data was missing a significant number of frames such that reconstruction of the 3D hand model was impossible.
-
Invalid inputs/Outliers: This includes identifying any data points that are significantly different from the rest of the data set (Outliers). These can be indicative of errors or anomalies in the data collection process. For instance, if the same gesture is proposed for all the referents or many gestures are unrecognizable we treat that informant as an outlier/invalid. Further, we observed some questionnaires had extremely low completion times which we considered to be a lack of attention and compliance from the informant.
Following this validation process, five participants’ data were discarded from the IL-GES. One participant completed the post-questionnaire with extremely low time with zero variance in the Likert scale questions. Another participant encountered challenges comprehending the notion of mid-air gestures and started to interact with the devices by physically reaching out to them, despite the explicit instruction provided for all participants to remain seated throughout the experiment. Among the remaining three participants, one did not suggest gestures for two referents, and we encountered difficulties with gesture recording due to at least one camera failure with the other two participants. For the VR-GES, we discarded one participant, due to an incomplete post-questionnaire.
4 Results
In this section, we present the similarities and differences we found between the two empirical methods in terms of the ratings of the standardized questionnaires, the average times for answering the questionnaires, and the qualitative feedback.
4.1 Participant experience
4.1.1 Comparison of task loads
Since VR-GES bring a novel empirical design it is important to investigate the participants’ experience and their comfort levels compared to the existing IL-GES design. It is often a best practice to evaluate the suitability of the experiment from the participant’s perspective when proposing novel empirical designs. Therefore, we first investigated the task load that participants have to go through in each of these designs. For this purpose used raw-NASA TLX questionnaire (Hart 2006; Hart and Staveland 1988). This helps us to determine if there were any significant differences between the task loads between the two designs. With the null hypothesis that there is no difference between the overall tasks loads of each design, we conducted a non-parametric Mann–Whitney U test after looking at the overall task load rating data distribution [Shapiro-Wilk w = 0.928, p <.001] to determine whether there is a difference in overall task load rating scores (See Fig. 7a) between the two designs. The results indicate a non-significant difference between the task loads of the two empirical designs, [U = 159.18, p = 0.775]. In conclusion, we fail to reject the null hypothesis and conclude that there is no difference in the overall task load ratings between VR-GES and IL-GES.
Followed by the overall task load comparison, we conducted a nuanced analysis of each sub-category of the NASA Task Load Index (TLX) to compare the mental, physical, and temporal demands of tasks, as well as participants’ satisfaction with their performance, effort levels, and frustration. We utilized the Mann–Whitney U-test to compare the task loads of each design for each sub-scale, with the null hypothesis that there were no differences in these scales between the designs. Our results showed that the mental demands and physical demands of the tasks had statistically significant differences in mean ratings (U = 37.0, p <.001 and U = 253.0, p =.003, respectively). However, we found no significant differences between the designs for the temporal demand, performance satisfaction, and effort sub-scales (U = 364.0, p = 0.189; U = 426.0, p = 0.679; U = 366.0, p = 0.203, respectively). In terms of frustration, we found a significant difference between the two designs (U = 79.5, p <.001). Overall, we failed to reject the null hypothesis for temporal demand, performance satisfaction, and effort sub-scales, concluding that there were no significant differences in these scales between the VR-GES and IL-GES designs. However, for the mental, and physical demands, and frustration levels of the performed tasks, we rejected the null hypothesis and concluded that there were significant differences between the two designs.
4.1.2 Comparison of usability of each design
Next we investigated whether the VR-GES design has a similar usability as the original IL-GES empirical design. As Voit et al. (2019)’s work shows VR has the closest resemblance to real-world scenarios, hence we started with the null hypothesis, both empirical designs carry similar usability. We obtained the overall usability rating scores of SUS questionnaires from the two participant groups (as shown in Fig. 7c) and they both fall into the ’good’ category according to the SUS standard analysis. Then we calculated each participant’s average usability ratings and considered that in investigating whether there were any significant differences between the usability of each design. With the normally distributed rating score with nearly equal variances (supplemented by the Shapiro-Wilk normality test (p = 0.06) and Levene’s homogeneity of variances test (p = 0.54), a Welch two-sample t-test showed that the usability difference was statistically insignificant (t(57.8) = 0.277, p = 0.783; where, t(57.8) is a shorthand notation for a Welch t-statistic that has 57.8 degrees of freedom). This confirms the similarities that Voit et al. (2019) have observed between VR and in-situ studies.
4.1.3 Comparison of the illusion and feel of autonomy in each design
In GESs, creating the illusion of autonomy is paramount to eliciting useful proposals from participants (Henschke 2020). Therefore, we investigated how successful each empirical design is in creating the illusion of autonomy, also known as “control", and “presence". Here the term presence is used with the meaning of the ability of a user to feel that they are in an actual location. As discussed in Sect. 2.4, both VR-GES and IL-GES used artificially created room setups aiming to provide the illusion of autonomy (control) and presence to make participants believe they are in a smart setup and devices are responding to their hand gestures. To start with we hypothesized, both VR-GES and IL-GES have similar control and presence ratings overall. To test our hypotheses Mann–Whitney U tests were conducted which determine whether there is a statistically significant difference in participants’ control, presence, and overall rating scores between VR-GES and IL-GES (shown in Fig. 7b). The results indicated a non-significant difference between the overall ratings [U = 405, p = 0.502], and presence ratings [U = 400, p = 0.455] of the two empirical designs. Yet, there’s a significant difference between the control ratings [U = 309, p = 0.033] of the two empirical designs. Therefore we rejected the null hypothesis that VR-GES and IL-GES designs have similar control ratings but concluded that there is no difference in the overall rating and presence ratings between VR-GES and IL-GES. We discuss this observation further in the discussion section.
4.1.4 Pragmatic and hedonic quality comparison
The tasks and the instructions provided in a GES could be stereotypical (depending on the experiment design) and could cause participant fatigue, boredom, and less attention which could affect the final results of the experiment. Thus, reducing fatigue and keeping participants engaged is an important factor to consider when developing an experiment. Especially, if the elicitation study contains a large number of referents, which is typically the case when eliciting gestures for systems such as smart homes, vehicle infotainment systems, etc. Therefore, under the participant experience, we compared the pragmatic and hedonic qualities of each empirical design. We evaluated Pragmatic (PQ), Hedonic Quality Identity (HQI), Hedonic Quality Simulation (HQS), and Attractiveness (ATT) using AttrakDiff (Hassenzahl et al. 2003) questionnaire. Again we started with the null hypothesis that these four attributes (PQ, HQI, HQS, and ATT) ratings between the two designs (shown in Fig. 8a) do not differ between VR-GES and IL-GES and conducted Welch’s t-test. We assumed each rating had a normal distribution and the variance of the sample is nearly equal based on the Shapiro-Wilk normality test and Levene’s homogeneity of variances test respectively. Welch’s t-test results for Pragmatic Quality (PQ) ratings show no statistically significant difference between the ratings between the two designs (t(57.8) = −1.25, p = 0.216) thus we concluded that both the design has similar pragmatic quality ratings. This resonates with our finding in the SUS questionnaire as PQ is aimed to measure the usability aspects of the designs. However, for all the other three attributes we found statistically significant differences between the two designs. For Hedonic Quality Simulation (HQS) we found a significant difference between the ratings (t(51.5) = 9.97, p <0.001), for Hedonic Quality Identity also we found a significant difference (t(50.4) = 3.15, p = 0.003), and for ATT ratings as well (t(56.6) = 4.58, p <0.001). Thus, we concluded that while VR-GES and IL-GES designs have similar pragmatic qualities they differ in terms of hedonic-stimulation, hedonic-identification, and overall attractiveness. Also, the portfolio graph (shown in Fig. 8b) of IL-GS shows that it falls in the intersection of the task-oriented and desired quadrants, with similar scores with VR-GES on the pragmatic dimension but with different hedonic ratings. VR-GES evaluation showed better hedonic quality than Il-GES. This indicates that users perceive VR-GES design as both functional and enjoyable to participate in, leading to a positive user experience. The high scores in the hedonic dimension suggest that the product also evokes positive emotions in users, making it highly desirable.
4.2 Gesture elicitation
Next, we looked into the elicited gesture proposals from each method. Theoretically, both the GES designs should produce the same or similar gesture vocabularies as the purpose of a GES is to identify the potential best gestures at the early stage of gestural interfaces/interaction design. Also, this evaluates the reproducibility of a GES which is under-explored in the literature as well. So we hypothesized both VR-GES and IL-GES produce the same gesture vocabulary as the studies were conducted independently with two participant groups. However, it is important to notice that we are conducting this elicitation study in a ’setup’, unlike an elicitation study that focuses on a single device. E.g., GES for smart TV interactions only. In this work, we looked at elicitation studies as discussed by Vogiatzidakis et al. (2019) where elicitation study is conducted in a setup that contains multiple devices similar to a real-world scenario henceforth the participant proposals can be affected by the environment. Vogiatzidakis et al. (2019) further highlight that the setups could bring different needs and mental models to participants and they may reuse gestures for similar referents. We observed this phenomenon where we see common best gestures across the devices with similar referents.
After collecting the gesture proposals from both VR-GES and IL-GES, we conducted a gesture agreement analysis on those elicited gestures separately using Agreement Rate AR shown in Eq. 1. This method was introduced by Vatavu et al. (2015) to observe the participant agreement on the gesture proposals. In the Eq. 1 ’P’ is the number of all gesture proposals received for a given referent r in a single study, and Pi is the subset of equivalent proposals. Calculated agreement rates for all the referents are shown in Fig. 9. Then we classified these referents as low, medium, high, and very high based on the AR interval magnitudes, which are \(\le \) 0.100, 0.100 to 0.300, 0.300 to 0.500 and > 0.500 as proposed by Vatavu et al. (2015).
Overall, 21 out of 22 referents carry similar proposals in both studies which is a 95.4% similarity. The only referent that carried two different gesture proposals was BF_ON. However, this referent has a low agreement rate from both the studies (VR-GES 0.189 and IL-GES 0.218) for its best gesture. However, it is interesting to notice that the second best gesture retrieved from IL-GES for this referent was the best gesture of VR-GES which had a frequency ratio of 8:9 in IL-GES (i.e., 8 proposals of gesture number 8 and 9 proposals of gesture number 7 as shown in the Table 3).
Looking at overall AR, both the designs show an agreement that falls in the same classification bracket (0.300 to 0.500) by Vatavu et al. (2015), IL-GES shows the highest overall agreement rate among the two studies with the value 0.44 which is an 8.8% higher than VR-GES which had the value 0.40. However, it is important to note that AR can affect by the number of unique gesture proposals received from a study while the total proposal count remains the same. Thus, we looked at the number of unique gestures produced by each design during the elicitation. As a result of the effects of legacy bias reduction, participants potentially propose novel gestures that are suitable for emerging technologies. Thus, we were interested in investigating the unique gesture proposal frequencies in each of these designs. As shown in Fig. 10 VR GES produced the highest number of unique gestures for 76.2% of the referents while TV_CHNL40, WB_OPEN, WB_CLOSE shows an equal number with IL-GES. However, IL-GES recorded the highest unique gesture proposals over VR-GES in BF_OFF and BF_SWNOFF referents. Therefore, overall we observed that the VR-GES has produced a more diverse set of proposals compared to IL-GES. Table 3 shows the final gesture vocabulary with synthetic 3D hand models generated by the VR-GES analysis tool.
4.3 Task completion time (TCT) analysis
In order to gain further insights into these empirical designs, we conducted a detailed analysis of the task completion times (TCTs) of each design. We analyze three different time components of TCT: i) total time taken for the experiment, ii) the time spent on filling out all the questionnaires, and iii) the overall completion time of the GESs. We carried out separate comparisons of these three measures to identify any trends and ascertain whether there were statistically significant differences. Our initial null hypothesis posited that there would be no significant differences in task completion times between the VR-GES and the IL-GES for any of the three measured time values.
To investigate our hypothesis about TCTs, we conducted separate t-tests for each of the three time components. For the first time component, the assumption tests showed that both designs’ TCT data had normal distributions according to Shapiro-Wilk (W=0.973, p = 0.211) and unequal variances according to Levene’s test (F=11.2, p = 0.001). Therefore, we conducted a Welch’s t-test, which revealed a statistically significant difference in experiment completion times between VR-GES and IL-GES, with t(42.8) = 7.83, p < 0.001. In the second time component, the assumption tests showed that the TCT data for both groups met the assumptions of normal distribution according to Shapiro-Wilk (W=0.980, p = 0.426) and equal variance according to Levene’s test (F=1.26, p = 0.266). Therefore, we conducted a two-sample t-test, which again yielded a statistically significant difference in the time spent on filling out all the questionnaires between VR-GES and IL-GES, with t(58) = −2.10, p = 0.04. For the third time component, assumption tests revealed normal distributions and equal variances for the TCT data of both groups according to Shapiro-Wilk (W=0.985, p = 0.659) and Levene’s test (F=2.98, p = 0.089). Thus, we conducted a two-sample t-test which showed a significant difference in the overall completion time of the GESs between VR-GES and IL-GES, with t(58) = 4.35, p < 0.001. Therefore, based on these results, we rejected the null hypothesis that there would be no significant differences in TCT between VR-GES and IL-GES under any of the three time components.
Further, we plot the time taken by each participant for these three time components to identify trends in VR-GES and IL-GES empirical design in terms of study completion times as shown in Fig. 11. Based on the linear regression analysis on all three conditions we observed that IL-GES has a negative gradient in experiment completion and overall GES completion times. This indicates that as the study continues, the time spent on conducting the experiment decreases. This could be due to the facts such as the investigator getting familiar with the procedure or fatigue. The steeper slope of the line shows a more significant effect on the total experiment time in IL-GES compared to VR-GES. It is noteworthy that the VR-GES trend line slope is close to zero in all three conditions, indicating a very little effect on the experiment completion time of each participant. This indicates that a greater number of participants exhibited a lower degree of deviation from the mean completion time.
This finding is a potential indication of consistency in the experimental design and procedures, and participants had similar levels of engagement and motivation across the study. Therefore, the effect of different experimental factors could be more accurately assessed without the influence of extraneous variables.
4.4 Qualitative analysis
The qualitative analysis focuses on understanding the effects of each design and also the quality of participant feedback. The qualitative questions focused on getting feedback from participants on their perspective and experience with the experiment, the questions they had, suggestions, and realism (i.e., whether they will behave similarly in a real scenario/situation as they behave in this simulated environment in lab or VR). Participants were given the opportunity to enter long answers to these questions without any restrictions.
During the analysis, we conducted two rounds that involved two researchers starting from thematic analysis with emergent coding (also known as open coding) aiming to identify the subsets and categories on the participant feedback. Then the two researchers compared, discussed, and reached a consolidated list of categories that all agreed upon. Finally, in the second round, the authors continued the analysis using axial coding to derive the themes to understand participants’ perspectives and experiences on the two empirical designs and the level of illusion in each design. It is also important to note that several subcategories such as privacy, stereotypical/repetitive instructions, accessibility, and motivation/boredom are in-vivo codes that were directly extracted from participant feedback. Through the analysis, we identified four main themes Gestures, Environment, Illusion and Immersion, and Concerns and twenty-nine subcategories (not reported) across the two designs. Below we highlight several major outcomes based on main themes.
Under the theme “gestures", we categorized the feedback that explains participants’ thoughts and perspectives on their performed gestures. In VR-GES many participants have commented on the virtual hand and their related experiences. This includes the virtual hands color changing mechanism, representation of real hands, thoughts about the speed of the performed gesture, and not having haptic. For instance, several participants have mentioned, the “Color change of virtual hand was helpful" for them to know when to “perform and stop the gesture" and “virtual hands mimic" their “real hands very well". Another participant also added a suggestion saying “Add an audio input parallel to hand color change". In IL-GES, the feedback that was categorized under the “gestures" were mostly related to their performed gestures. For instance, several participants mentioned that they were “not sure where to perform the gesture as there were multiple cameras", “I wasn’t sure how long I could take for a gesture", and stated questions such as “how did a device know when I ended the gesture". Overall this evinces that participants’ has paid attention to the experimental procedures and that could have an impact on their proposed gesture proposals.
Then we identified feedback that reflects participants’ experiences with the “environment" they were in. This aims to investigate the impact of virtual and physical environments on participants’ engagement and attention. In the VR-GES, participants provided feedback regarding the lighting and the objects in the virtual environment, which affected their engagement. For instance, one participant noted that “the lighting of the room is a bit too dark," while another participant reported that “the table lamp does not light up the room as expected." Additionally, one participant reported their hand “goes through the table" in the virtual environment, indicating the limitations of the virtual environment. These observations suggest that participants were attentive to the environment and that it influenced their engagement during the experiment. Further, in the IL-GES, participants provided feedback regarding the camera setup, which affected their attention. Several participants found the tripods to be distracting, leading them to “look at the camera all the time," while some participants suggested that the cameras should have been "wall mounted or hidden" to minimize their visibility. Additionally, participants reported that the “setup looks real, but the cameras and wiring sometimes give the feeling it is artificial." These reflections also suggest that participants were attentive to the environment and that the camera setup influenced their attention during the IL-GES experiment. Overall, the feedback provided by participants under the theme “environment" indicates that both virtual and physical environments have an impact on participants’ engagement and attention. Therefore, it is crucial to carefully construct and design the environment to ensure that it facilitates optimal engagement and attention.
The feedback provided by participants also offers insight into the areas of “illusion and immersion," which we have grouped together. One aspect that participants commented on was the directional audio effects we added to devices such as the bladeless fan, with one participant noting that the "fan produces real sound and heard as it coming from it." Similarly, feedback on the TV, such as “I really felt as I am watching TV," demonstrated the level of immersion experienced by participants. Despite explicitly informing participants that the VR application was an experiment to collect hand gesture data, some participants in the VR-GES commented that the study "felt like a game" to them. For example, participants noted that “it is more like a VR game. I complete all the tasks," and “I think all the devices responded to my gestures.". These comments suggest that the VR environment immersed participants and gave them a sense of control, even though they were aware of the experimental nature of the study. Similarly, in the IL-GES, participants noted that they were curious about how to convey numbers such as 40 to a device and that they “felt like [they] are talking to devices." They also expressed a desire to have similar control over devices in their own homes, with comments such as “I wish I could control devices in my room like that" and “I wish my room’s window blind open as I raise my hand." However, some participants also seemed to recognize the WoZ method, noting that there was always a small delay when the device responded, and that “I think experimenters control the devices, but it was done well." Some participants even mentioned that they believed that the button press triggered the device, rather than their gestures. In contrast to the IL-GES, in the VR-GES, participants made no comments that indicated that the illusion was compromised. Instead, they commented that the study was like a "simulation for real-world smart home" and expressed a hope to see “gesture-controlled smart homes soon in reality." These findings demonstrate that overall virtual and physical environments can create a sense of immersion and illusion and that participants’ engagement and attention can be influenced by these factors. Further virtual environments are good at creating illusions of control. Therefore, it is important to carefully design and construct environments that enhance immersion and minimize potential disruptions to the illusion.
The final theme we identified pertained to participants’ “concerns" about the study design. Some of these comments were participants’ thoughts and suggestions for improving the study designs. Many participants expressed discomfort with being captured on camera from multiple angles in the IL-GES, with some questioning the privacy implications of such recording. For instance, one participant said, “recording from 6 angles sometimes made [them] uncomfortable", while another in VR-GES wondered, “will there always be cameras in the real world monitoring [their] gestures?". Some participants have highlighted voice assistants (Amazon Echo, Siri, etc.) and their video-capturing capabilities (Amazon Echo Show, Google Nest Hub) and questioned the potential of privacy concerns with video recordings. In both VR-GES and IL-GES, participants commented on the monotonous and repetitive nature of the instructions, with some saying they knew what was coming next after several instructions. Furthermore, one participant in VR-GES raised an accessibility concern, specifically related to color blindness. They questioned how a color-blind person would recognize the color change mechanism in VR-GES and suggested incorporating audio feedback to supplement the visual cues. Finally, some participants commented that they were confused with the directional instructions such as “Turn Right" and “Turn Left". These concerns and suggestions from participants provide valuable insights into the study design and potential areas for improvement. We will discuss these findings further in Sect. 5.
5 Discussion
This study compares two empirical designs for gesture elicitation studies: one conducted using virtual reality technology (VR-GES), and the other inside a laboratory setting (IL-GES). During this study, we analyzed and compared the elicited gestures, agreement scores, and participant experiences of the two designs to determine what are the similarities and differences between each design when eliciting gesture interactions. The results of this study help, interaction designers, and HCI practitioners to decide on using VR as a proxy to conduct interactive gesture elicitation by overcoming the challenges that exist in the laboratory-based empirical design.
When evaluating and comparing different empirical designs of an HCI study, it is essential to understand the participants’ experiences in each design. This helps in assessing the effectiveness of novel designs compared to established ones. To start with, in our study, we utilized the NASA TLX questionnaire to compare the task load of the VR-GES and IL-GES designs. Our analysis revealed that the overall task load ratings were similar for both designs. However, upon conducting a more nuanced analysis, we observed significant differences in mental demands, physical demands, and frustrations between the two designs. The mental demands of the VR-GES study were higher than those of the IL-GES study, which we attributed to the novelty of the medium. A previous study reported that the use of VR technology in experimental settings can be perceived as more mentally demanding (Pouliquen-Lardy et al. 2016; Holzwarth et al. 2021). In this study we observed that participants in the VR-GES design even provided more feedback on hand gestures, environment, and suggestions for improvements, indicating that the VR experience stimulated more mental activity during and after the study. We were surprised to find differences in physical demands between the two designs, as both groups proposed gestures for a similar number of referents. However, we hypothesize that the engagement differences observed in the qualitative analysis may have contributed to these differences. Several participants in the VR-GES study reported feeling a sense of playing a game, which could have contributed to enjoyment, and hence felt less physical demands of the study. This finding is consistent with previous studies that have reported higher levels of engagement in VR environments (Lu et al. 2018). Furthermore, this analysis also revealed significant differences in frustration levels between the two designs, with participants in the VR-GES study reporting less frustration than those in the IL-GES study. This finding is consistent with the results of the task completion time (TCT) analysis, which showed that the VR-GES study had a lower experiment completion time. The reduced time spent on the task may have resulted in less fatigue and frustration among participants. The novelty of the VR experience may also have contributed to the reduction of boredom, fatigue, and discomfort among participants. In addition, the analysis of attractiveness and hedonic qualities of each design revealing that VR-GES had better hedonic qualities and higher attractiveness compared to IL-GES, could further reason out the less frustration rating in VR-GES between the two studies.
Another interesting outcome is the ability of VR-GES to create a better illusion of autonomy for gesture elicitation. In a GES creating an illusion of system autonomy and presence is critical to the success of the study (Henschke 2020). Our results indicate that VR-GES was more successful in creating an illusion of system autonomy compared to IL-GES. This finding is supported by participants’ qualitative feedback mentioning that their gestures were being recognized by the VR application. This is in line with previous studies that have shown that VR is effective in creating an illusion of system autonomy due to its ability to provide realistic sensory feedback (Witmer and Singer 1998; Slater et al. 2009). However, our analysis also revealed that IL-GES had a slightly higher rating in terms of presence, although this difference was not statistically significant. This finding is likely due to the fact that the physical lab was converted to a smart room with actual devices, making the setup more similar to an in-situation study. Overall, the analysis shows that VR-GES is an excellent alternative for situations where the actual setup or the system is not available, or difficult to prototype especially when designing gesture interactions for future technologies. Therefore VR-GES provides the opportunity for designers and researchers to create a more engaging and realistic experience for participants, leading to more accurate and reliable data in the early design stages of gesture-controlled systems.
The TCT results show that VR-GES exhibits consistent completion times across all participants, while IL-GES had variations where it was susceptible to systematic errors that may arise due to investigator mistakes. Specifically, during IL-GES, we observe the investigator spends more time with the first set of participants as compared to later participants. Furthermore, the number of participants that can be accommodated per day was limited, and considerable time is required between two participants to back up the video recording and charge the cameras for the next study, thus making the study process more time-consuming. On the other hand, VR-GES alleviates these concerns by allowing designers and researchers to focus more on data and analysis and not worry about pragmatic issues.
Followed by the participant experience evaluation we compared the agreement and elicited gesture sets of the two designs. The results showed that both empirical designs had very good overall agreement rates of 0.40 and 0.44, respectively. Moreover, the final gesture sets produced from the two GESs had a 95.4% similarity rate, with only one gesture being different. These findings suggest that VR-GES is a promising alternative to IL-GES for eliciting gesture vocabulary and overcoming challenges in initial GES empirical design. The only referent that gave a different gesture after the agreement analysis is turn on bladeless fan. This had relatively low agreement rates (IL-GES BF_ON\(-\)0.218, VR-GES BF_ON - 0.189) compared to the other 21 referents in both GESs, as it had a high number of unique gesture proposals. Despite no specific evidence being provided in the participant’s feedback regarding this referent, we found that the second-best gesture in IL-GES was the best gesture in VR-GES for this referent. This observation further support for our assertion that VR-GES has the capability to produce gesture sets that are comparable to those obtained from a physically conducted study using a setup resembling the final product.
Further, we observed that participants in VR-GES performed a greater number of unique gestures compared to the IL-GES. This was attributed to participants’ heightened attention to their virtual hands and their increased engagement with the virtual environment. Participants in the VR-GES have expressed more thoughts and comments about the gestures they performed and their virtual hands. Participant feedback shows that they have paid more attention to their virtual hands to observe whether they accurately mimicked their real hand gestures. They also seemed to focus on the virtual hand gesture color change and were cautious about performing the gesture within the given recording time frame. This increased attention to their hands and their performance resulted in participants thinking about different types of gestures they could perform with the combination of their hands and fingers. In contrast, participants in the IL-GES appeared to be distracted by concerns about whether the device would capture their performed hand gestures. Our analysis of the video recordings from the IL-GES revealed that participants sometimes performed the gestures faster without paying much attention to the finger positions or orientation of the hand. These findings suggest that participants in the VR-GES paid more attention to their hands and were more engaged in being more creative with their gesture proposal in the gesture elicitation process, making it a more suitable method for eliciting novel gestures for emerging technologies where more engagement and unique gesture proposals are required.
One participant raised an important point about the accessibility of our empirical design, particularly regarding our use of a color-changed mechanism. While this observation is valid, it is worth noting that our design draws inspiration from the VR-GES design by Perera et al. (2021), which employs a traffic light metaphor - a popular design pattern in user experience (UX) design. The traffic light metaphor uses a color-coded system to convey information to users, similar to the colors used on traffic lights. However, in our design, we incorporated ’REC’ (short for recording) letters in the participant’s field of vision (FoV) every time virtual hands changed colors. This addition aimed to enhance the accessibility of our design. Nevertheless, adding a third indicator could also be a viable alternative design that deserves further exploration.
During our analysis of the two GESs, we noted that some participants were confused by the turn right and turn left commands. Specifically, they were uncertain whether the commands referred to the device’s right or the user’s right. For example, security camera turn left and right referents. While this issue can be addressed in an IL-lab GES where the investigator is presented to observe and clarify any ambiguities, we did not adopt this approach to maintain consistency across both studies. However, providing clear referent instructions and conducting pilot studies to test them is crucial in a remote and encapsulated VR GES. Alternatively, labeling the direction or using a different design approach could also be a helpful addition, providing an opportunity for further investigation.
One significant distinction we observed between IL-GES and VR-GES was their respective methods for recording and analyzing gestures. It is important to capture and view gestures from multiple angles to reduce the likelihood of misclassification. However, IL-GES produces multiple video footage for a single gesture (in our case 6 for each). Thus, in terms of the designers’ and HCI practitioners’ time and effort, VR-GES is more efficient than IL-GES since it does not require the analysis of multiple videos to classify gestures. Otherwise, this process can be time-consuming and prone to human errors. Also sometimes, critical angle footage may be corrupted, and some gestures may not be recognizable from other another angle. As shown in Fig. 6b and d it could be challenging to determine the gesture, but the ability to view the gesture from other angles can help. Thus, different gestures may require different angles to be easily identifiable, and if the cameras are not positioned correctly, some collected symbols could be unusable.
Based on our study, we concluded that while both the designs produced very similar gesture vocabularies, overall, there are effects such as VR-GES producing a high number of unique and creative gesture proposals, and creating better participant experiences and engagement. Also from the investigators’ perspective, the VR-GES design appears to be useful for collecting reliable data and efficient analysis, reducing potential errors that could occur during investigations. This empirical evidence supports the effectiveness of VR-GES in reducing investigator errors, thereby attesting to its efficacy and consistency in practical applications.
Also, our study emphasizes the importance of considering participants’ experiences when selecting and designing empirical designs in HCI studies. Our GES analysis suggests that VR-based gesture elicitation studies may provide a more engaging and mentally stimulating experience for participants, leading to reduced boredom and fatigue and improved task performance. Therefore, we postulate that VR-based gesture elicitation studies can serve as a viable alternative to traditional in-lab methods, especially when seeking to engage a diverse range of participants. Specifically, for designers aiming to create gesture interfaces that cater to global audiences, VR-based design can prove useful in the early stages of system development.
Furthermore, the findings of this study are particularly significant when viewed in light of replication frameworks such as RepliGES (Gheran et al. 2022), which emphasize the importance of reproducibility and flexibility in GESs. VR-GES offers considerable advantages in terms of maintainability and reproducibility, as it allows researchers to easily modify experimental conditions while preserving consistency across studies. This flexibility makes VR-GES a versatile experimental design for HCI researchers and experiment designers, who can leverage the decision matrix provided in Sect. 6 to select the most appropriate design based on available resources and replication goals.
6 Best practices when conducting GESs
Based on our comparison study, we recommend the several best practices for gesture interaction designers and HCI practitioners in designing and selecting empirical designs for GESs.
-
When conducting GESs for novel technologies that require unique and creative gesture proposals, using VR-GES design is more suitable. This approach can create a better illusion of system autonomy and ecological validity than IL-GESs. Additionally, VR-GES design is better suited for studying gesture interactions for systems in various settings such as controlling operating rooms (Madapana et al. 2018; O’hara et al. 2013), smart homes (Vogiatzidakis and Koutsabasis 2019; Hoffmann et al. 2019), and vehicles (May et al. 2017; Young et al. 2020).
-
It is important to have representative user samples in the GESs when the gesture interfaces are designed for global audiences. Therefore, for studies that require recruiting a diverse set of participants from different geographies, the VR-GES design is more appropriate. On this point, the original VR-GES design by Perera et al. (2021) shows that experienced and non-experienced participants showed no difference in study completion time, meaning random sampling is feasible.
-
For industrial design projects that require rapid prototyping, the VR-GES design could be more appropriate due to the minimal logistical requirements, being resistant to investigator error, and ease of the analysis phase compared to IL-GESs. For instance, in IL-GESs, setting up and maintaining the study setup for long periods can be challenging in industrial research. Further, we noticed that the cameras need to be backed up and charged after every second participant, which increases the task load of the experimenter in IL-GESs.
-
When developing a VR-GES environment, it is important to consider the participant experience and their safety. Therefore, it is advisable to reduce motion and maintain a stable position throughout the experiment (i.e., either a sitting or standing position throughout the experiment). This can help reduce potential discomfort such as dizziness among participants.
-
Including a trial session is crucial for VR-GESs. This allows participants to familiarise themselves with the feature and the setup, which can reduce potential confusion and learning effects in remote and encapsulated VR-GESs. Before embedding instructions in the VR-GESs it is best to test the instruction with a pilot study or using other pre-evaluation methods. For instance, in our study we noticed how ‘rotate camera’ and ‘turn camera’ instructions could affect the participants’ proposals. Further, when instructing participants, it is best to use both auditory and textual instruction and provide only the minimum required instructions, as proposed in the initial VR-GES design of Perera et al. (2021). It is best not to provide an option to skip the introductory and safety instructions, and important to reduce unnecessary repetition when designing remote VR-GESs.
-
When designing VR-GESs, it is advisable to minimise the tasks participants must perform. In our design, participants only needed to press a button, and no other interaction is required in the GES other than the gesture proposals. This reduces the ‘novelty effect’ resulting from emerging technology and allows participants to focus on the task at hand.
-
Finally, post-questionnaires are more suitable for VR-GESs, while structured interviews may be a better data collection instrument for IL-GESs. In our study, participants tended to give general comments verbally immediately after leaving the lab environment, rather than providing lengthy written feedback in the post-questionnaire.
To assist experiment designers in selecting between IL-GES and VR-GES, a decision matrix has been developed. This matrix evaluates each experiment design across several key criteria, with scores assigned on a scale from 1 to 4, where 1 indicates poor suitability and 4 indicates excellent suitability. The weights for each criterion reflect its importance in the decision-making process, based on the specific use case. These weights can be adjusted according to the researcher’s discretion. To use the matrix, designers can input the weights for each criteria. The total scores will help determine which method better aligns with the study’s requirements, guiding the selection of the most appropriate GES design.
We applied weights to the scenario of eliciting gestures for a smart office environment on several criteria, including ecological validity, technology availability, participant safety, etc.. The total weighted scores for IL-GES and VR-GES were calculated as 65 and 95, respectively. The higher score for VR-GES indicates that, for this specific use case, VR-GES is more suitable compared to IL-GES. This conclusion is based on the criteria and weights assigned, which can be adjusted based on different research contexts or specific requirements (Table 4).
7 Limitation and future work
While our study provides insights into the use of VR-based gesture elicitation studies as an alternative to traditional in-lab methods, there are several limitations and avenues for future research that should be considered. In this section, we discuss these limitations and propose directions for future research to address them.
The virtual study environment in our study was developed using the ProBuilder plugin in Unity 3D, which provided adequate control over the lighting and graphics and enabled integration with Perera et al.’s VR-GES (2021) and its utilized modularized functionalities such as in-built hand data extraction and remodeling. However, it is worth noting that using a dedicated 3D environment development software may yield a more realistic environment. Therefore, the graphics of the VR duplicate in our study could be improved to provide a more realistic sense and even though the effect of graphics on the ‘presence’ score of the comparison may or may not be significant.
While our study involved a between-group comparison of VR-GES and IL-GES, future studies could also explore within-group comparisons to observe potential differences. However, there are concerns related to bias and learning effects in within-subject GES studies, especially when the aim is to compare the two empirical designs of a study. Therefore this requires further research and consideration in the study design. One possible approach could be to invite a sub-sample of participants after a significant time lapse, such as three months, to minimize the carry-over effects from their previous study participation. Nevertheless, further investigation is needed to ensure the time lapse is sufficient and to examine the effectiveness of this approach.
In addition, we identified that there is a need for questionnaires that are more robust and established to compare and measure user behaviors in the real world and its digital clones. In an era where digital twins are proliferating and Metaverse is emerging rapidly, we identified that having established questionnaires to study human behaviour compared to the real world could be valuable. This is another direction that future research could investigate.
Finally, our findings suggest that for behavioural studies such as GESs, VR-GES is a promising alternative to IL-GES, as it provides a more consistent and immersive environment for participants, and reduces pragmatic issues. However, future research could explore the potential of VR-based studies for other types of behavioral studies in HCI, to overcome some of the common issues identified in the Sect. 1 of this work.
8 Conclusion
This paper contributes to the still scarce literature in investigating the effects of using VR-based empirical design to conduct GESs. This study is the first to compare and validate GES design that uses VR and established lab-based design. In this study, we investigated the suitability of VR for GESs in comparison to established in-lab empirical design. We made the in-lab setup equivalent to an in-situation with the use of real devices and the environment. Then we conducted two GESs in an in-lab setup and its digital duplicate in a VR environment. The comparison results of the two studies show that overall, 95.4% referents carry similar proposals in both studies with agreement rates that fall in the same classification bracket. Although the IL-GES shows 8.8% higher agreement than VR-GES, we observed that the VR-GES produced a more diverse set of proposals compared to IL-GES making VR-GES more suitable for eliciting novel gestures. In addition, both study designs show no differences in usability, overall task loads, feel of presence as in a real environment, and pragmatic qualities. However, VR-GES shows a higher potential of creating the required illusion of autonomy than IL-GES. Additionally, in terms of hedonic stimulation, hedonic identification, and overall attractiveness of the studies, VR-GES was better than IL-GES. Furthermore, the amount and the quality of the written qualitative feedback we received were high in VR-GES as well. Therefore, we conclude that using VR for GESs outside the laboratory is viable and produces reliable and compatible results compared to the physical analogue at lower time cost and with the potential to reach much wider and non-local audiences. At a time when digital technologies are proliferating and transforming social interactions, such as Metaverse, our study shows that using novel empirical designs with VR for GESs is a worthwhile and valid methodology to consider when developing gesture interfaces.
Data availability
The anonymised datasets generated during the current study are available from the corresponding author upon reasonable request.
References
Norman DA, Draper SW (1986) User centered system design: New perspectives on human-computer interaction
Wobbrock JO, Morri, MR, Wilson AD (2009) User-defined gestures for surface computing. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1083–1092
Salovaara A, Oulasvirta A, Jacucci G (2017) Evaluation of prototypes and the problem of possible futures. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems 2064–2077
Bargas-Avila JA, Hornbæk K (2011) Old wine in new bottles or novel challenges: a critical analysis of empirical studies of user experience. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems 2689–2698
Koeman L (2020) HCI/UX Research: What Methods do we use?. https://lisakoeman.nl/blog/hci-ux-research-what-methods-do-we-use/
Wobbrock JO, Aung HH, Rothrock B, Myers BA (2005) Maximizing the guessability of symbolic input. In: CHI’05 Extended Abstracts on Human Factors in Computing Systems 1869–1872
Villarreal-Narvaez S, Vanderdonckt J, Vatavu R-D, Wobbrock JO (2020) A systematic review of gesture elicitation studies: What can we learn from 216 studies? In: Proceedings of the 2020 ACM Designing Interactive Systems Conference 855–872
Magrofuoco N, Vanderdonckt J (2019) Gelicit: a cloud platform for distributed gesture elicitation studies. Proceedings of the ACM on Human-Computer Interaction 3(EICS), 1–41
Villarreal-Narvaez S, Sluÿters A, Vanderdonckt J, Vatavu R-D (2024) Brave new GES world: a systematic literature review of gestures and referents in gesture elicitation studies. ACM Comput Surv 56(5):1–55
Morris MR, Wobbrock JO, Wilson AD (2010) Understanding users’ preferences for surface gestures. In: Proceedings of Graphics Interface 2010. 261–268
Ali AX, Morris MR, Wobbrock JO (2019) Crowdlicit: A system for conducting distributed end-user elicitation and identification studies. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems 1–12
Voit A, Mayer S, Schwind V, Henze N (2019) Online, vr, ar, lab, and in-situ: comparison of research methods to evaluate smart artifacts. In: Proceedings of the 2019 Chi Conference on Human Factors in Computing Systems, pp. 1–12
Mottelson A, Hornbæk K (2017) Virtual reality studies outside the laboratory. In: Proceedings of the 23rd Acm Symposium on Virtual Reality Software and Technology 1–10
Kourtesis P, Collina S, Doumas LA, MacPherson SE (2021) An ecologically valid examination of event-based and time-based prospective memory using immersive virtual reality: the effects of delay and task type on everyday prospective memory. Memory 29(4):486–506
Perera M, Gedeon T, Adcock M, Haller A (2021) Towards self-guided remote user studies-feasibility of gesture elicitation using immersive virtual reality. In: 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2576–2583. IEEE
Bellucci A, Zarraonandia T, Díaz P, Aedo I (2021) Welicit: A wizard of oz tool for vr elicitation studies. In: IFIP Conference on Human-Computer Interaction, 82–91. Springer
Henschke M (2020) User behaviour with unguided touchless gestural interfaces. PhD thesis, ANU (Australia)
Nielsen CM, Overgaard M, Pedersen MB, Stage J, Stenild S (2006) It’s worth the hassle! the added value of evaluating the usability of mobile systems in the field. In: Proceedings of the 4th Nordic Conference on Human-computer Interaction: Changing Roles, 272–280
Rogers Y, Connelly K, Tedesco L, Hazlewood W, Kurtz A, Hall RE, Hursey J, Toscos T (2007) Why it’s worth the hassle: The value of in-situ studies when designing ubicomp. In: International Conference on Ubiquitous Computing, 336–353. Springer
Kjeldskov J, Skov MB, Als BS, Høegh RT (2004) Is it worth the hassle? exploring the added value of evaluating the usability of context-aware mobile systems in the field. In: International Conference on Mobile Human-Computer Interaction, 61–73. Springer
Duh HB-L, Tan GC, Chen VH-h (2006) Usability evaluation for mobile device: a comparison of laboratory and field tests. In: Proceedings of the 8th Conference on Human-computer Interaction with Mobile Devices and Services, 181–186
Sun X, May A (2013) A comparison of field-based and lab-based experiments to evaluate user experience of personalised mobile devices. Adv Human-Comput Interact 2013
May KR, Gable TM, Walker BN (2017) Designing an in-vehicle air gesture set using elicitation methods. In: Proceedings of the 9th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, 74–83
Fariman HJ, Alyamani HJ, Kavakli M, Hamey L (2016) Designing a user-defined gesture vocabulary for an in-vehicle climate control system. In: Proceedings of the 28th Australian Conference on Computer-Human Interaction, 391–395
Ruiz J, Li Y, Lank E (2011) User-defined motion gestures for mobile interaction. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 197–206
Kray C, Nesbitt D, Dawson J, Rohs M (2010) User-defined gestures for connecting mobile phones, public displays, and tabletops. In: Proceedings of the 12th International Conference on Human Computer Interaction with Mobile Devices and Services, 239–248
Nebeling M, Teunissen E, Husmann M, Norrie MC (2014) Xdkinect: development framework for cross-device interaction using kinect. In: Proceedings of the 2014 ACM SIGCHI Symposium on Engineering Interactive Computing Systems, 65–74
Vatavu R-D, Zaiti I-A (2014) Leap gestures for tv: insights from an elicitation study. In: Proceedings of the ACM International Conference on Interactive Experiences for TV and Online Video, 131–138
Wu H, Wang J, Zhang XL (2016) User-centered gesture development in tv viewing environment. Multimedia Tools Appl 75(2):733–760
Dong H, Danesh A, Figueroa N, El Saddik A (2015) An elicitation study on gesture preferences and memorability toward a practical hand-gesture vocabulary for smart televisions. IEEE access 3:543–555
Vogiatzidakis P, Koutsabasis P (2019) Frame-based elicitation of mid-air gestures for a smart home device ecosystem. In: Informatics 6: 23. MDPI
Hoffmann F, Tyroller M-I, Wende F, Henze N (2019) User-defined interaction for smart homes: voice, touch, or mid-air gestures? In: Proceedings of the 18th International Conference on Mobile and Ubiquitous Multimedia, 1–7
Choi E, Kwon S, Lee D, Lee H, Chung MK (2014) Towards successful user interaction with systems: focusing on user-derived gestures for smart home systems. Appl Ergon 45(4):1196–1207
Vogiatzidakis P, Koutsabasis P (2022) ‘Address and command’: two-handed mid-air interactions with multiple home devices. Int J Human-Comput Stud 159:102755
Gheran B-F, Vanderdonckt J, Vatavu R-D (2018) Gestures for smart rings: Empirical results, insights, and design implications. In: Proceedings of the 2018 Designing Interactive Systems Conference, 623–635
Williams AS, Garcia J, Ortega F (2020) Understanding multimodal user gesture and speech behavior for object manipulation in augmented reality using elicitation. IEEE Trans Visual Comput Gr 26(12):3479–3489
Piumsomboon T, Clark A, Billinghurst M, Cockburn A (2013) User-defined gestures for augmented reality. In: IFIP Conference on Human-Computer Interaction, 282–299. Springer
Pham T, Vermeulen J, Tang A, MacDonald Vermeulen L (2018) Scale impacts elicited gestures for manipulating holograms: Implications for ar gesture design. In: Proceedings of the 2018 Designing Interactive Systems Conference, 227–240
Döring T, Kern D, Marshall P, Pfeiffer M, Schöning J, Gruhn V, Schmidt A (2011) Gestural interaction on the steering wheel: reducing the visual demand. In: Proceedings of the Sigchi Conference on Human Factors in Computing Systems, 483–492
Hollan J, Hutchins E, Kirsh D (2000) Distributed cognition: toward a new foundation for human-computer interaction research. ACM Trans Comput-Human Interact (TOCHI) 7(2):174–196
Kühnel C, Westermann T, Hemmert F, Kratz S, Müller A, Möller S (2011) I’m home: defining and evaluating a gesture set for smart-home control. Int J Human-Comput Stud 69(11):693–704
Madapana N, Wachs J (2019) Database of gesture attributes: Zero shot learning for gesture recognition. In: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 1–8. IEEE
Morris MR, Danielescu A, Drucker S, Fisher D, Lee B, Schraefel M, Wobbrock JO (2014) Reducing legacy bias in gesture elicitation studies. Interactions 21(3)
Chamunorwa M, Wozniak MP, Krämer S, Müller H, Boll S (2023) An empirical comparison of moderated and unmoderated gesture elicitation studies on soft surfaces and objects for smart home control. Proceedings of the ACM on human-computer interaction 7(MHCI), 1–24
Gaver B, Dunne T, Pacenti E (1999) Design: cultural probes. Interactions 6(1):21–29
Hutchinson H, Mackay W, Westerlund B, Bederson BB, Druin A, Plaisant C, Beaudouin-Lafon M, Conversy S, Evans H, Hansen H (2003) Technology probes: inspiring design for and with families. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 17–24
Ratcliffe J, Soave F, Bryan-Kinns N, Tokarchuk L, Farkhatdinov I (2021) Extended reality (xr) remote research: a survey of drawbacks and opportunities. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1–13
Ratcliffe J, Soave F, Hoover M, Ortega FR, Bryan-Kinns N, Tokarchuk L, Farkhatdinov I (2021) Remote xr studies: Exploring three key challenges of remote xr experimentation. In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 1–4
Ma X, Cackett M, Park L, Chien E, Naaman M (2018) Web-based vr experiments powered by the crowd. In: Proceedings of the 2018 World Wide Web Conference, 33–43
Steed A, Frlston S, Lopez MM, Drummond J, Pan Y, Swapp D (2016) An ‘in the wild’ experiment on presence and embodiment using consumer virtual reality equipment. IEEE Trans Visual Comput Gr 22(4):1406–1414
Silva FV, Mattos Brito Oliveira FC, Moraes Alves R, Castro Quintinho G (2022) Gesture elicitation for augmented reality environments. In: International Conference on Human-Computer Interaction, 143–159. Springer
Perera M, Gedeon T, Haller A, Adcock M (2023) Using virtual reality to overcome legacy bias in remote gesture elicitation studies. In: International Conference on Human-Computer Interaction. Springer
Saffo D, Yildirim C, Di Bartolomeo S, Dunne C (2020) Crowdsourcing virtual reality experiments using vrchat. In: Extended Abstracts of the 2020 Chi Conference on Human Factors in Computing Systems, 1–8
Koutsabasis P, Vogiatzidakis P (2019) Empirical research in mid-air interaction: a systematic review. Int J Human-Comput Interact 35(18):1747–1768
Dandurand F, Shultz TR, Onishi KH (2008) Comparing online and lab methods in a problem-solving experiment. Behav Res Methods 40(2):428–434
Clifford S, Jerit J (2014) Is there a cost to convenience? an experimental comparison of data quality in laboratory and online studies. J Exp Polit Sci 1(2):120–131
Germine L, Nakayama K, Duchaine BC, Chabris CF, Chatterjee G, Wilmer JB (2012) Is the web as good as the lab? comparable performance from web and lab in cognitive/perceptual experiments. Psychon Bull Rev 19(5):847–857
Andreasen MS, Nielsen HV, Schrøder SO, Stage J (2007) What happened to remote usability testing? an empirical study of three methods. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1405–1414
Sharp H, Rogers Y, Preece J (2007) Interaction Design: beyond Human-computer Interaction. Wiley
Gustarini M, Ickin S, Wac K (2013) Evaluation of challenges in human subject studies" in-the-wild" using subjects’ personal smartphones. In: Proceedings of the 2013 ACM Conference on Pervasive and Ubiquitous Computing Adjunct Publication, 1447–1456
Nielsen M, Störring M, Moeslund TB, Granum E (2003) A procedure for developing intuitive and ergonomic gesture interfaces for HCI. In: International Gesture Workshop. Springer
Rovelo Ruiz GA, Vanacken D, Luyten K, Abad F, Camahort E (2014) Multi-viewer gesture-based interaction for omni-directional video. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 4077–4086
Troiano GM, Pedersen EW, Hornbæk K (2014) User-defined gestures for elastic, deformable displays. In: Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces, 1–8
Leng HY, Norowi NM, Jantan AH (2017) A user-defined gesture set for music interaction in immersive virtual environment. In: Proceedings of the 3rd International Conference on Human-Computer Interaction and User Experience in Indonesia, 44–51
Masai K, Kunze K, Sakamoto D, Sugiura Y, Sugimoto M (2020) Face commands-user-defined facial gestures for smart glasses. In: 2020 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 374–386. IEEE
Bader P, Le HV, Strotzer J, Henze N (2017) Exploring interactions with smart windows for sunlight control. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, 2373–2380
Silpasuwanchai C, Ren X (2015) Designing concurrent full-body gestures for intense gameplay. Int J Human-Comput Stud 80:1–13
Dim NK, Silpasuwanchai C, Sarcar S, Ren X (2016) Designing mid-air tv gestures for blind people using user-and choice-based elicitation approaches. In: Proceedings of the 2016 ACM Conference on Designing Interactive Systems, 204–214
Cafaro F, Lyons L, Kang R, Radinsky J, Roberts J, Vogt K (2014) Framed guessability: using embodied allegories to increase user agreement on gesture sets. In: Proceedings of the 8th International Conference on Tangible, Embedded and Embodied Interaction, 197–204
Cafaro F, Lyons L, Antle AN (2018) Framed guessability: Improving the discoverability of gestures and body movements for full-body interaction. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 1–12
Ali A, Ringel Morris M, O. Wobbrock J, (2021) “I am iron man” priming improves the learnability and memorability of user-elicited gestures. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1–14
Gero KI, Chilton L, Melancon C, Cleron M (2022) Eliciting gestures for novel note-taking interactions. In: Proceedings of the 2022 ACM Designing Interactive Systems Conference, 966–975
Ch NAN, Tosca D, Crump T, Ansah A, Kun A, Shaer O (2022) Gesture and voice commands to interact with ar windshield display in automated vehicle: a remote elicitation study. In: Proceedings of the 14th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, 171–182
Lee S-S, Chae J, Kim H, Lim Y-k, Lee K-p (2013) Towards more natural digital content manipulation via user freehand gestural interaction in a living room. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, 617–626
Dennis BK (2014) Understanding participant experiences: reflections of a novice research participant. Int J Qualitative Methods 13(1):395–410
Field A, Hole G (2002) How to Design and Report Experiments. Sage
Lazar J, Feng JH, Hochheiser H (2017) Research Methods in Human-computer Interaction. Morgan Kaufmann
Fuller A, Fan Z, Day C, Barlow C (2020) Digital twin: enabling technologies, challenges and open research. IEEE access 8:108952–108971
Hassenzahl M, Burmester M, Koller F (2003) Attrakdiff: Ein fragebogen zur messung wahrgenommener hedonischer und pragmatischer qualität. In: Mensch & Computer 2003, 187–196. Springer
Brooke J (1996) Sus-a quick and dirty usability scale. Usability Eval Ind 189(194):4–7
Hart SG (2006) Nasa-task load index (nasa-tlx); 20 years later. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 50, pp. 904–908. Sage publications Sage CA: Los Angeles, CA
Hart SG, Staveland LE (1988) Development of nasa-tlx (task load index): Results of empirical and theoretical research. In: Advances in Psychology 52: 139–183. Elsevier
Banakou D, Groten R, Slater M (2013) Illusory ownership of a virtual child body causes overestimation of object sizes and implicit attitude changes. Proc Natl Acad Sci 110(31):12846–12851
Kilteni K, Maselli A, Kording KP, Slater M (2015) Over my fake body: body ownership illusions for studying the multisensory basis of own-body perception. Front Human Neurosci 9:141
Kittur A, Chi EH, Suh B (2008) Crowdsourcing user studies with mechanical turk. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 453–456
Vatavu R-D, Wobbrock JO (2015) Formalizing agreement analysis for elicitation studies: new measures, significance test, and toolkit. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 1325–1334
Pouliquen-Lardy L, Milleville-Pennel I, Guillaume F, Mars F (2016) Remote collaboration in virtual reality: asymmetrical effects of task distribution on spatial processing and mental workload. Virt Real 20:213–220
Holzwarth V, Schneider J, Handali J, Gisler J, Hirt C, Kunz A, Brocke J (2021) Towards estimating affective states in virtual reality based on behavioral data. Virt Real 1–14
Lu F, Yu D, Liang H-N, Chen W, Papangelis K, Ali NM (2018) Evaluating engagement level and analytical support of interactive visualizations in virtual reality environments. In: 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 143–152. IEEE
Witmer BG, Singer MJ (1998) Measuring presence in virtual environments: a presence questionnaire. Presence 7(3):225–240
Slater M, Khanna P, Mortensen J, Yu I (2009) Visual realism enhances realistic response in an immersive virtual environment. IEEE Comput Gr Appl 29(3):76–84
Gheran B-F, Villarreal-Narvaez S, Vatavu R-D, Vanderdonckt J (2022) Repliges and gestory: visual tools for systematizing and consolidating knowledge on user-defined gestures. In: Proceedings of the 2022 International Conference on Advanced Visual Interfaces, 1–9
Madapana N, Gonzalez G, Rodgers R, Zhang L, Wachs JP (2018) Gestures for picture archiving and communication systems (pacs) operation in the operating room: Is there any standard? PloS One 13(6):0198092
O’hara K, Harper R, Mentis H, Sellen A, Taylor A (2013) On the naturalness of touchless: putting the “interaction” back into NUI. ACM Trans Comput-Human Interact (TOCHI) 20(1): 1–25
Young G, Milne H, Griffiths D, Padfield E, Blenkinsopp R, Georgiou O (2020) Designing mid-air haptic gesture controlled user interfaces for cars. Proc ACM Human-Comput Interact 4(EICS), 1–23
Acknowledgements
We would like to express our gratitude to all those who have contributed to this research. We would like to thank Commonwealth Scientific and Industrial Research Organisation (CSIRO) and the Australian National University (ANU) for providing us with access to their facilities, equipment, and other resources that were essential for the completion of this work. Further, we would like to express our deep gratitude to the study participants, without whom this research would not have been possible. We thank for their time, effort, and willingness to share their experiences and perspectives, which have enriched our understanding of the topic. We also extend our thanks to Zhangcheng Qiang for his assistance in setting up and conducting user studies. We thank all of you and acknowledge your valuable contributions to this work.
Funding
Open access funding provided by CSIRO Library Services.
Author information
Authors and Affiliations
Contributions
Madhawa Perera—Contributed to developing the conceptual framework, undertook VR development, designed the user studies, conducted the user study, data analysis, and authored the manuscript. Tom Gedeon—Contributed to refining and enhancing the conceptual framework, designing the user study, guiding data analysis, and providing input into manuscript editing. Matt Adcock— Contributed to refining and enhancing the conceptual framework, designing the user study, guiding data analysis, and providing input into manuscript editing. Armin Haller—Contributed to refining and enhancing the conceptual framework, designing the user study, and providing input into manuscript editing.
Corresponding author
Ethics declarations
Conflict of interest
No potential conflict of interest was reported by the authors.
Ethical approval
The study was approved by the ethics committee of the Australian National University (ANU)(ethics protocol 2020/012) and was conducted according to the ANU ethics guidelines. The ANU Ethics Committee reviewed and approved the study protocol to ensure that it met the ethical standards for research involving human subjects. All the participants were provided with information regarding the aims and procedures of the study, and their informed consent was obtained before the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Perera, M., Gedeon, T., Adcock, M. et al. VR & in-lab GESs: an analysis of empirical research designs for gesture elicitation in in-lab and virtual reality settings. Virtual Reality 29, 59 (2025). https://doi.org/10.1007/s10055-025-01098-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10055-025-01098-0