Keywords

1 Introduction

Borders and airports are becoming increasingly congested as more people are traveling. In this environment, border guards must make rapid decisions with myriad distractions, pressures, and limited vigilance. Credibility assessment which is a vital task to secure borders and other points of entry of the country is generally carried out manually by a border security/law enforcement officer. But this task is hindered by obvious human bias. A technology-based solution to support credibility assessment has been investigated leading to the AVATAR [7], an embodied conversational agent (ECA) with integrated sensors that conducts automated credibility assessments on travelers. The AVATAR is implemented as a self-service kiosk with an integrated screen where the ECA is represented as a law enforcement officer, interviews passengers while integrated sensors measure nonverbal/verbal and physiological behaviors of the interviewees. An important aspect of the AVATAR is its ECA’s human/intelligent appearance and demeanor to interviewees. Great attention to detail has been placed on the ECA’s demeanor, voice/inflection, clothes, language, and appearance and their effect on the perceptions and behaviors of interviewees.

With the improvements to Artificial Intelligence (AI) for predictive modeling and the accessibility of programmable robots, this next logical step for this research in credibility assessment is to make the interaction more human and intimate.

The transition for humans seeing robots in science fiction films and having robots working and assisting them in the day to day tasks has been swift. George Devol, an American inventor from Kentucky designed the first industrial robot: Unimate in the 1950s. But we have come a long way since then. Mindful of the ethics [17] involved with AI and governed by Asimov’s Laws of robotics, companies are now developing cutting edge robots that can do various tasks not just faster but better than humans. Robots are being made that look like humans. Softbank Robotics is one such company whose robots we are using in our research whose programmable robots are pictured below in Fig. 1.

Fig. 1.
figure 1

(Source: SoftBank Robotics)

Programmable humanoid robots

These humanoid robots are capable of realistic dialogue and nonverbal gesturing. Most of the initial applications for these robots are confined to tasks in customer service, greeting visitors and providing product and service descriptions. Various industrial sectors like automobile, aerospace, construction, and defense have adopted robots to make life easier for humans around them. In contrast, this research aims to investigate richer interactions that rely on interpersonal communication and credibility assessment theories to study the emerging area of human-robotic interaction.

The study will have participants come and interview with both AVATAR and a programmable humanoid robot. We plan to use the robot augmented with additional behavioral and physiological sensors that are essential to detect deception cues on it. These sensors include microphone, video and eye tracker. The robot will iterate through around 14 questions few of which are to establish a baseline for the deception model. The robot exhibits various interactive cues like blinking of eyes, hand gestures to intuitively convey the question.

2 Literature Survey

As automation becomes increasingly prevalent in the service and manufacturing industries additional applications such as automated law enforcement and border control technologies are being explored. Within the many possible applications of automation to law enforcement, automated interviewing and credibility assessment using a humanoid robot interviewer are the focus of this study and related literature.

2.1 The Need for Better Credibility Assessment in Law Enforcement

In the law enforcement realm, determining if someone is telling the truth is a critical step in any investigative effort. Whether they are trying to cross a border or give a witness statement, what a person says and the way their words are interpreted are crucial to the next investigative steps. Despite our belief in our ability to detect deception, novice human interviewers are only able to correctly determine if an interviewee is deceiving 54% of the time, correctly detecting deception 47% of the time and detecting truths as nondeceptive 61% of the time [2]. The higher than chance accuracy for truth detection reflects an intrinsic truth bias where novice interviewers classify most statements as true. Conversely, experts and law enforcement often reverse their accuracies and demonstrate a lie bias, assuming most statements are deceptive. Although some law enforcement personnel receive training in deception detection, a study in 2004 by Garrido, Masip, and Herrero [13] found law enforcement officers assessment of credibility to result in the same near-chance levels.

The polygraph is the most well-known method for determining veracity in the law enforcement setting. To administer a basic polygraph test requires the participant to physically attach sensors in several parts of their body to measure cardiovascular activity, respiratory activity, and electrodermal activity [16]. Attaching these sensors to a participant is an invasive and time-consuming process that can itself make a non-deceiver uncomfortable and elicit responses that can influence polygraph measurements [19]. Aside from being an invasive and time-consuming process, modern scientific opinion on the polygraph is highly polarized. None of the dominant methodologies for analyzing polygraph sensor data have achieved an acceptable degree of empirical support when investigating crimes in progress [16].

2.2 AVATAR for Credibility Assessment

In addition to only detecting deception at chance levels, human interviewers introduce individual variability to each interview they give. A human interviewer has their own personal interview style in addition to fluctuations in their disposition, both of which can influence how an interviewee answers questions and behaves [4]. To eliminate interviewer variance [7], Nunamaker et al. proposed the creation of an Embodied Conversational Agent (ECA) to run on a kiosk that would conduct interviews while collecting sensor data about the interviewees. Figure 2 illustrates the current prototype of the AVATAR system used. This original ECA kiosk design was later field tested by Elkins, Derrick, and Gariup in 2012 [10]. The AVATAR can administer interviews consistently removing the behavioral variance caused by a human interviewer by asking interview questions in a systematic and controlled way while measuring behavioral and physiological reactions.

Fig. 2.
figure 2

AVATAR kiosk for automated creditability assessment

In the 2012 experiment, human participants passed through a mock visa checkpoint by conducting an interview with the AVATAR. Some participants holding fake visas were instructed to lie during the interview. Using a microphone and eye-tracking sensor the AVATAR obtained a 94% overall detection accuracy rate for identifying imposters. 100% of imposters were correctly categorized, with only two false positives [10].

According to Interpersonal Deception Theory, first proposed by Buller and Burgoon in 1996, deception is a strategic interaction between sender and receiver. During a deception, a deceiver must manage additional cognitive demands which result in leakage of deception cues [3]. As summarized by Elkins et al., deceivers speak with a greater and more varied vocal pitch, shorter durations, less fluency, and with greater response latencies [4]. Deceivers also have ocular cues that can indicate arousal from deception and plausibility management. Different blink duration, blink frequency, pupil dilation, and eye-gaze fixation all can be indicative of an attempt at deceit [4]. The camera, eye tracker, and microphone sensor data from the AVATAR are fused and reviewed by a classification engine which benefits from not relying on a single indicator of deception [9].

The AVATAR kiosk can administer consistent interviews and make quick, accurate decisions on the veracity of responses to a set of predefined questions. The use of non-invasive sensors like the microphone and eye tracker camera offers a way to collect physiological data from the interview participants without introducing variation due to discomfort from physical sensors.

3 Study Objectives

This research will help us understand how people perceive robot agent interactions in contrast to the more traditional ECA agents confined to a screen. Also, certain perception-based changes that a person might have during this interaction with the robot are metrics we would like to consider.

3.1 The Following List Outlines the Objectives

  • Find out which interaction (ECA or humanoid robot) elicits the most diagnostic behavioral cues for detecting deception

  • Study if does the physical presence of a robot affect perceptions of interaction and speaking partner such as likability, expertise, or dominance

  • Comparison of deception detection accuracy results of AVATAR vs Robot Interview

  • Understand the societal and personal implications of human-robotic interactions

  • Determine how to surreptitiously incorporate and calibrate multiple behavioral sensors (e.g., eye tracker) into a humanoid robot form factor and interview environment.

4 Conceptual Background of Research

An important aspect of the AVATAR is the use of an ECA represented by a face on the kiosk screen that asks interview questions. Having a face that interviewees speak to creates a more interactive social context for the deception to take place with access to visual, auditory, verbal, and environmental channels. According to Buller and Burgoon, a less interactive social context for the interaction limits the cues that would be produced [3]. This paper proposes a comparison between the current animated face ECA and a physically present humanoid robot as the interviewer to see how a humanoid robot changes the presence and robustness of deception cues elicited by the interviewee.

There is evidence people view humanoid robots in positions of authority as less authoritative and credible. A study by Edwards, Edwards, and Omilion-Hodges conducted simulated medical interviews with a human physician and a robot physician using a Softbank supplied Nao robot. Survey results after the interviews reported the human physician scored higher on positive affect, perceived credibility, and social presence [8].

Wood et al. found that children interviewed by humanoid robots interacted very similarly with a robot interviewer as with a human interviewer. There were however differences in the duration of robot interviews, the eye gaze patterns towards robot interviewers, and the response time of robot interviewers [5]. There is concern that if the alternative eye gaze behavior applies broadly to how people look at humanoid robots, our humanoid robot may induce eye behavior that confounds the AVATAR classification engine.

Alternatively, the robots humanoid appearance could produce stronger responses from interview participants due to its humanoid appearance. Latikka, Turja, and Oksanen found that when testing different robot types in an elderly care facility to assist human workers, humanoid robots found faster acceptance by human staff and higher reported self-efficacy [15]. Self-efficacy in this scenario is the beliefs a worker has about their ability to use robots effectively. The fact the humanoid robot performed better than non-humanoid robots in this measure suggests a familiarity with humanoid robots that could translate into a more interactive social context for our deception experiment. In this richer context, more open communication channels could lead to more visible deception cues [3].

5 Experimental Design

The most challenging aspect of conducting research into nonverbal deception detection regardless of the interviewer is establishing internal validity. Obtaining ground truth which is clear and unambiguous behavioral examples of deception is challenging. Generally, credibility assessment research focuses on three experiment profiles in increasing ecological validity at the expense of experimental control and production of clear unambiguous lie examples: Deceptive Interviews, Mock Crime, and Field Experiments.

5.1 Deceptive Interviews

In this paradigm (example studies: [11, 12]), participants are recruited to participate in a study ostensibly on job interviewing strategies or student intelligence for example. When the participants arrive in the lab they are instructed to complete an interview with their primary task being to be perceived as credible as possible by the interviewer (human, AVATAR, or humanoid robot). Participants are incentivized monetarily to be successful and often induced with an experimental prime, such as tying their performance to their identity or self-efficacy.

During the interview, participants are asked a wide variety of question types such as: biographical, ethical dilemmas, and short or long form answers. During the interview, participants are randomly instructed to lie or tell the truth to the questions via a teleprompter that is obscured from the interviewer. After sets of questions, the interviewee reports how credible they were in their answers to the questions, and their overall confidence level in their performance. After completing the interview, participants complete a post-survey where they report their perceptions, beliefs, and feelings during the interview.

The strength of this experimental design is that there is a high level of control and many clear truthful and deceptive behavior examples are generated. Additionally, with interviewees reporting how credible they were, deception is treated as a continuous variable of truth rather than binary true or lie. There is a difference between a white lite and a complete fabrication. Also, the ability to create a diverse pool of question responses also improves the models built to classify deceptive behavior.

The primary weakness of this design is that participants feel no jeopardy or concern making the measured behaviors partially incongruent to a high-stakes security environment. How detrimental this weakness is to the research is often overstated as a concern only specifically activates arousal and physiological cues, while there are many other categories of cues expected by liars such as cognitive load, memory, emotion, behavioral control, and communication strategies.

5.2 Mock Crime

In contrast to the previous experiment design, the mock crime experiment increases ecological validity and perceived participant jeopardy by instructing participants to commit a crime and subsequently completing an investigative interview (example studies: [6, 7, 18]). There are many scenarios that have been evaluated including: impostership where participants take on a fake identity, smuggling drugs or a bomb, self-selecting to cheat on an exam, or stealing a valuable object.

After committing the crime, participants then complete a security interview. For example, in a bomb smuggling study, participants would attempt to conceal a bomb hidden in their bag and complete a border security interview credibly, which would inquire about their travel plans, destination, identity, contents of their bag, and other customs-related questions.

The primary strength of this design is its relevance to how automated credibility systems would be used. Moreover, there is no universal cue to all lies. To develop a reliable system for deception detection, the context, questions, and situation must be modeled from data uniquely. A deception detection model based on border security interviews would not necessarily function if applied to a criminal investigation or a fraud interview.

The major downside to this approach is the reduced experimental control and collection of much fewer examples of deceptive behavior. Often, 80% of the questions asked are irrelevant questions and only 20% of the asked questions require the participant to lie (e.g., Are you carrying a prohibited object?). Despite its limitations, this experimental design is the most common in the field.

5.3 Field Experiment

After establishing the validity of the technology in the lab, the next step is to take it to the field such as airports, border crossings, and law enforcement facilities. In these locations, the ecological validity is highest, and participants can be recruited from the local population (e.g., passengers disembarking from international flights). In this scenario, participants can be given instructions (experimental manipulation) or simply asked to complete an interview naturally.

This type of study is typical in the later stages of the research and offers valuable insight into how technology is perceived by actual future users of the technology. The major downside of this type of study is the reduced experimental control, reliable lie examples or ground truth can be impossible to collect without experimental manipulation.

6 Discussion

For this research, we will be starting with a more controlled scenario such as a deceptive interview. This will allow for a more direct comparison between deceptive interviews conducted by AVATAR and its humanoid robot counterpart.

The implications of this research depend on if the humanoid robot interviewer is found to be more effective, less effective, or not significantly different at eliciting deception cues than the AVATAR. The various implications of each scenario are separated and discussed below.

6.1 Scenario: Humanoid Robot Outperforms AVATAR

In the case that the humanoid robot performs better at eliciting deception cues from interviewees, than very serious consideration to transitioning to humanoid robot interviews must be considered. This will introduce a new set of research questions necessary to deploying a humanoid robot into the field, such as sensor placement, power requirements, interview positioning (e.g., seated or standing), just to name a few.

In the job market there is a need to determine the veracity of applicants for certain high-stakes positions, but in recent years controversy has surrounded and limited the use of the most popular credibility assessment for job screening – the polygraph [14]. If a robot equipped with the AVATAR sensor suite could perform quick, noninvasive, and reliable job screening then employers could hire with more confidence in their candidates.

The use of the polygraph for credibility assessment in general law enforcement has come under similar scrutiny to job screening. According to Ben-Shakhar, Bar-Hillel, and Lieblich, many of the studies supporting the effectiveness of the polygraph were found to have several types of methodological contamination [1]. A robot equipped with AVATAR sensors could administer interviews for determining veracity and possibly have those determinations submitted as evidence in the courtroom someday.

Future research on the successful pairing of the AVATAR with humanoid robots could incorporate different voice and appearances. Interpersonal Deception Theory specifies that the sender’s (deceiver) communications are affected by the communications of the receiver robot). Changes to the robot’s demeanor or costuming the robot in a uniform could communicate perceived authority and possibly elicit stronger deception cues from the deceiver.

6.2 Scenario: Robot Under Performs Against AVATAR

If the robot performs worse than the AVATAR then analysis of the participant’s perceptions toward the robot and AVATAR interviewers will be critical. Survey questions given to study participants after their interview should indicate what potential moderators or mediators affected the elicitation of cues to deception. Further studies on different robot heights, voices, dispositions, and other variations in the interview should still be conducted.

Regardless of how well a humanoid robot interviewer performs in the task of credibility assessment this research is only the beginning of the investigation into human-robotics interaction and socialization as humanoid robots continue to become more commonplace in society as social actors and increase in their AI and emotional intelligence.