1 Introduction

Deception often occurs in certain contexts of daily life, which can cause severe consequences and losses to individuals and society. Automatic deception detection methods towards multi-turn QA can benefit many applications, such as criminal interrogation, court depositions, interviews, and online marketplaces. However, text-based context deception detection has not been explored sufficiently [29] mainly due to the lack of proper datasets and difficulty of finding deceptive signals. To alleviate this problem, we focus on deception detection in a multi-turn QA which aims at classifying each QA pair as deception or not, through the analysis of the context.

Table 1. Part of a multi-turn QA example in the dataset. \(Q_i\) means the \(i_{th}\) question; \(A_i\) means the \(i_{th}\) answer. “T” means the QA pair is truthful and “F” means deceptive.

Existing deception detection methods heavily rely on hand-crafted features including verbal [7, 11, 19, 21, 22, 27, 31, 32] and non-verbal [6, 18, 25, 28] cues explored from different modals, ignoring the use of semantic information implied in contexts and could not be applied to multi-turn QA data. Some tasks such as dialogue system [15] and multi-turn question answering [14, 30] seem to be similar to our task. However, they cannot be regarded as classification tasks, and cannot be directly applied to our task which is formed as a sentence-pair classification task. Thus, it is necessary to propose a novel approach for deceptive QA pairs recognition.

Intuitively, information implied in contexts is needed to understand the subjective beliefs of a speaker, which is an essential step to detect deceit [13]. For example, we cannot judge which QA pairs in Table 1 are deceptive or not without the given contexts. Furthermore, the features of deception are implicit and difficult to be detected. Due to the sparsity and complexity of latent deceptive signals, treating all of the context information equally will obstruct the model performance. As shown in Table 1, Turn-5 is relatively less relevant to Turn-2 while Turn-1, 3, and 4 are closely related to Turn-2. Taking all of the contexts into account probably hurt the model’s ability to recognize deception.

We propose two hypotheses: (1) QA context is conducive to detect deceit. (2) Noises implied in QA context hinder the accurate identification of deception. To address these two assumptions, we use BERT [5] to get context-independent sentence embeddings and BiGRU [3] to get context-aware sentence embeddings. More importantly, a novel context selector is proposed to filter out noise in the contexts. Due to the lack of a proper dataset, we construct a multi-turn QA dataset containing sequential dependent QA pairs for the experiments. We design different questionnaires covering six daily life topics to collect deceptive and non-deceptive data. Our contributions are:

(1) We make the first attempt to tackle the multi-turn QA-style deception detection problem, and design a novel Context Selection Network (CSN) to explore deceptive signals implied in the contexts effectively.

(2) To fill the gap of deception detection in multi-turn QA, a newly collected dataset Deception QA is presented for the target task.

(3) Comparing with several deep learning-based baselines, our model achieves the best performance on the collected dataset, showing its effectiveness.

2 Related Work

2.1 Deception Detection

To address the problem of automatic deception detection, researchers have carried out a series of studies in different scenarios such as the social network scenario and daily life scenario.

Social network-based deception detection has been studied for a long period in the research community. Most of them utilize propagation pattern of deceptive information [2] and interactions between multiple users [17] to detect deception. However, these features don’t exist in a multi-turn QA under the daily life situation. As a result, these methods cannot be applied to this new kind of task directly.

In addition, deception often occurs in the daily life scenario. Researchers have analyzed the features that can be used to detect deception. These features can be classified as linguistic features and interactions between individuals.

Linguistic Features: Some researches have shown the effectiveness of features derived from text analysis, which includes basic linguistic representations such as n-grams and Linguistic Inquiry and Word Count (LIWC) [19, 22], and more complex linguistic features derived from syntactic CFG trees and part of speech tags [7, 31]. Based on these research findings, many studies focused on text-based methods, recognizing deceptive languages in games [1, 27], online reviews [20], news articles [26], and interviews [12].

Interactions Between Individuals: Apart from linguistic features implied in texts, interactions between individuals can also have a beneficial effect on detecting deceit. Tsunomori et al. [29] examined the effect of question types and individuals’ behaviors on participants. Findings of the study show that specific questions led to more salient deceptive behavior patterns in participants which resulted in better deception detection performance.

These studies show that linguistic features and interaction between individuals in contexts contribute to deception detection. Therefore, deception detection in a text-based multi-turn QA is significant and reasonable. Although deceptive behavior often occurs in a multi-turn QA under daily life situation, due to the difficulty of finding deceptive signals and deception data collection and annotation, no work has been done on cues of deception drawing from text-based QA contexts. Unlike all the prior studies, this paper focuses on a novel task, that is, deception detection in a multi-turn QA. To the best of our knowledge, our work is the first attempt to perform deception detection in multi-turn QA.

2.2 Datasets Comparison

There have been a few of datasets based on different modalities developed for deception detection, such as text-based datasets [19, 21, 27, 32], audio-based datasets [9, 11] and multimodal-based datasets [24, 25, 28].

Some researchers proposed text-based datasets for deception detection. Ott et al. [21] developed the Ott Deceptive Opinion Spam corpus, which consists of 800 true reviews and 800 deceptive reviews. Mihalces et al. [19] collected data from three written deception tasks. Zhou and Sung [32] collected 1192 Mafia games from a popular Chinese website. de Ruiter and Kachergis [27] proposed the Mafiascum dataset, a collection of over 700 games of Mafia.

In addition to text-based datasets, some studies have developed audio-based datasets. Hirschberg et al. [9] were the first to propose audio-based corpus, which consists of 32 interviews averaging 30 min. Levitan et al. [11] collected a much larger corpus than it. However, these two datasets are not public available and free. Furthermore, it is hard to model the contextual semantics only based on the audio modality.

The multimodal datasets were all collected from public multimedia sources, such as public court trials [24], street interviews aired in television shows [25], and the Box of Lies game in a TV show [28]. The data cannot be annotated by the people who express deception or non-deception. The researchers labeled the data themselves after data collection, which may introduce human bias. Existing public multimedia sources also cannot provide adequate labeled samples for deep learning based deception detection methods. Moreover, compared with text data, processing multimodal data requires more computing resource.

Fig. 1.
figure 1

Overview of the CSN. The red dashed box in the sequence indicates the target QA pair to be predicted. The black arrow in the context selector means obtaining a mask matrix after the cosine similarity module and blue arrows pointing to the mask matrix mean inputs to it. The dotted arrows pointing from mask matrix mean mask value is 0 and the corresponding contexts are masked, while solid arrows mean mask value is 1. (Color figure online)

3 Model

3.1 Problem Formalization

Suppose that we have a dataset \(D={\{U_i,Y_i}\}^N_{i=1}\), where \(U_i={\{q_{il}, a_{il}}\}^L_{l=1}\) represents a multi-turn QA with L QA pairs and every sentence in a multi-turn QA contains T words. N is the number of multi-turn QAs in the dataset. \(Y_i={\{y_{il}}\}^L_{l=1}\) where \(y_{il}\in {\{0, 1}\}\) denotes the label of a QA pair. \(y_{il}=1\) means \({\{q_{il}, a_{il}\}}\) is deceptive, otherwise \(y_{il}=0\). Given the dataset, the goal of deception detection is to learn a classifier \(f: U\rightarrow Y\), where U and Y are the sets of QA pairs and labels respectively, to predict the label of QA pairs based on the context information in a multi-turn QA.

3.2 Model Overview

We propose CSN that generates context-independent sentence embeddings first, and then selects contexts for the target question and answer respectively to filter out the noise, and then utilizes the context encoder to get context-aware sentence embeddings. As illustrated in Fig. 1, the proposed model consists of Word Encoder, Context Selector, Context Encoder, and Question Answer Pair Classifier.

3.3 Word Encoder

Since the form of data collection is to design questions first and then collect corresponding answers, we treat a multi-turn QA as a combination of one question sequence and one answer sequence. The \(l_{th}\) question and answer with T words in the \(i_{th}\) multi-turn QA are defined as \({\{w^Q_{l1}, ..., w^Q_{lT}}\}\) and \({\{w^A_{l1}, ..., w^A_{lT}}\}\) respectively. We feed both sentences into the pre-trained BERT and obtain context-independent sentence embeddings, which are defined as \(g^Q_{l}, g^A_{l}\) for the question and answer respectively. In the experiments, we also replace BERT with BiGRU which proves the effectiveness of BERT.

3.4 Context Selector

Given a multi-turn QA and its sentence representations, we treat the questions and answers as two contexts: \(\mathcal {Q}={\{g^Q_{l}}\}^L_{l=1}, \mathcal {A}={\{g^A_{l}}\}^L_{l=1}\). We design a context selector to select contexts for target question and answer respectively in order to eliminate the influence of noise in the context.

We treat the answer of the QA pair to be predicted as key: \(g^A_{l}\), to select the corresponding answer contexts. We use cosine similarity to measure text similarity between the answer key \(g^A_{l}\) and the answer context \(\mathcal {A}\), which is formulated as:

$$\begin{aligned} s_{A_{l}} = \frac{\mathcal {A} g^{A\top }_{l}}{||\mathcal {A}||_2 ||g^A_{l}||_2}, \end{aligned}$$
(1)

where \(s_{A_{l}}\) is the relevance score.

Then we use the score to form a mask matrix for each answer and assign the same mask matrix to the question contexts, aiming to retain the consistency of the masked answer sequence and question sequence, which is formulated as:

$$\begin{aligned} \tilde{s}_{A_{l}} = (\sigma (s_{A_{l}}) \ge \gamma ), \tilde{s}_{Q_{l}} = \tilde{s}_{A_{l}}, \end{aligned}$$
(2)
$$\begin{aligned} Q_{l} = \tilde{s}_{Q_{l}} \odot \mathcal {Q}, A_{l} = \tilde{s}_{A_{l}} \odot \mathcal {A}, \end{aligned}$$
(3)

where \(\odot \) is element-wise multiplication; \(\sigma \) is the sigmoid function; \(\gamma \) is the threshold and will be tuned according to the dataset. The sentences whose scores are below \(\gamma \) will be filtered out. \(Q_{l}\) and \(A_{l}\) are the final contexts for \(q_{l}\) and \(a_{l}\).

The context selector can make the model focus on the more relevant contexts through filtering out the noise contexts and thus benefits the model of exploring context-sensitive dependencies implied in the multi-turn QA.

3.5 Context Encoder

Given the selected contexts of the target question and answer, we feed them to two BiGRUs respectively:

$$\begin{aligned} \tilde{Q}_{l} = \overleftrightarrow {GRU_Q}(Q_{(l(+,-)1)}, g^Q_{l}), \end{aligned}$$
(4)
$$\begin{aligned} \tilde{A}_{l} = \overleftrightarrow {GRU_A}(A_{(l(+,-)1)}, g^A_{l}), \end{aligned}$$
(5)

where \(\tilde{Q}_{l}\) and \(\tilde{A}_{l}\) represent the outputs of the \(q_{l}\) and \(a_{l}\) at the corresponding position in the two bidirectional GRUs. \(\tilde{Q}_{l}\) and \(\tilde{A}_{l}\) denote the context-aware embeddings of \(q_{l}\) and \(a_{l}\) respectively.

We use the two context encoders to model context dependencies between multiple answers and questions respectively. In this way, we can make full use of deceptive signals implied in the contexts to recognize deceptive QA pairs.

3.6 Question Answer Pair Classifier

Then, the context-aware embeddings of the target question and answer are concatenated to obtain the final QA pair representation:

$$\begin{aligned} h_{l} = [\tilde{Q}_{l}, \tilde{A}_{l}]. \end{aligned}$$
(6)

Finally, the representation of the QA pair is fed into a softmax classifier:

$$\begin{aligned} z_{l} = softmax(W h_{l} + b ), \end{aligned}$$
(7)

where W and b are trainable parameters.

The loss function is defined as the cross-entropy error over all labeled QA pairs:

$$\begin{aligned} \mathcal {L} = - \sum ^{N}_{i=1}\sum ^{L}_{l=1}y_{il} \ln z_{il}, \end{aligned}$$
(8)

where N is the number of multi-turn QAs; L is the number of QA pairs in a multi-turn QA and \(y_{il}\) is the ground-truth label of the QA pair.

4 Deception QA Dataset Design

Our goal is to build a Chinese text-based collection of deception and non-deception data in the form of multi-turn QA, which allows us to analyze contextual dependencies between QA pairs concerning deception. We design questionnaires related to different topics about daily life and then recruit subjects to answer these questions.

4.1 Questionnaires Design

To collect deceptive and non-deceptive data, we design six different questionnaires covering six topics related to daily life. These six themes are sports, music, tourism, film and television, school, and occupation. For each questionnaire, we design different questions. The number of questions in each questionnaire varies from seven to ten. Specially, the first question in the questionnaire is directly related to the corresponding theme as shown in Table 1. The following questions are designed subtly so that they can be viewed as follow-up questions for the first question. There are also progressive dependencies between these questions.

4.2 Answers Collection

To obtain deceptive and non-deceptive data, we recruit 318 subjects from universities and companies to fill in the six questionnaires. The numbers of collected multi-turn QAs for each theme are 337, 97, 49, 53, 51, and 49 respectively.

Each subject is asked to answer the same questionnaire twice to make the distribution of deceptive and non-deceptive data as balanced as possible. For the first time, subjects need to tell the truth to the first question. For the second time, they need to tell lies to the same first question. Subjects are allowed to tell the truth or lies to the following questions casually, but the final goal is to convince others that the subjects’ answers are all true. Questions in a questionnaire have sequential dependence, forcing the subjects to change their answers to the first question instead of other questions helps them better organize their expression to answer the following questions. In order to motivate subjects to produce high-quality deceptive and non-deceptive answers, we give them certain monetary rewards.

Similar to previous work [11], we ask the subjects to label their own answers. Subjects are asked to label their answers with “T” or “F”. “T” means what they say is truth and “F” means deceptive.

Table 2. Statistics of train, dev and test sets of Deception QA.

4.3 Train/Dev/Test Split

We obtain 636 multi-turn QAs and 6113 QA pairs finally. After shuffling all of the multi-turn QAs, we divide the data into train set, development set, and test set randomly according to the ratio of 8:1:1. Table 2 shows dataset statistics.

5 Experiments

5.1 Experimental Settings

Deception QA dataset is a Chinese dataset. JiebaFootnote 1 is employed to segment text into Chinese words and Glove [23] is employed to get pre-trained word embeddings. Moreover, we use Chinese BERT and RoBERTa with whole word maskingFootnote 2 [4]. For the context selector, \(\gamma \) is set to 0.63 according to the valid data. The performance is evaluated using standard Macro-Precision, Macro-Recall, and Macro-F1.

5.2 Baselines

The baselines are divided into two parts, according to whether take the context into consideration or not. Without considering the context, we compare our model with general text classification approaches: BiGRU [3], TextCNN [10], BERT [5] and RoBERTa [16]. Considering the context, we use BiGRU-CC, attBiGRU-CC, TextCNN-BiGRU, DialogueGCN [8], where CC means considering all the contexts and DialogueGCN is the state-of-the-art model of emotion recognition in conversation task. We propose CSN and CSN-BERT/-RoBERTa which have a subtlety-designed context selector to filter noise in the context.

Table 3. Comparison of varying approaches.
Table 4. Ablation study on deception QA dataset.

5.3 Results and Analysis

Results in Table 3 can be divided into three parts. From top to the bottom, it shows the results that do not consider the contexts, consider all the contexts and perform contexts selection.

From the first part, we find that methods based on pre-trained language models (PLMs) are almost better than general text classification models. From the second part, we find that approaches considering the contexts perform much better than those who don’t consider the contexts. This proves the effectiveness of the QA context to detect deception.

The model proposed by us achieves the best performance among all of the strong baselines. The Macro-F1 score of CSN-RoBERTa is 5.65% higher than that of RoBERTa and 6.61% higher than that of DialogueGCN. For other sequence-based approaches without the context selector, the Macro-F1 score of CSN-RoBERTa is 11.26% higher than them on average. It indicates that taking all of the contexts including noise can hurt the model performance. Besides context information, noise is another key factor that affects the ability of the model to recognize deception. The results indicate the effectiveness of our model.

From experimental results in Table 4, we can observe that removing the context selector results in performance degradation. The results of the ablation study on three models show that the Macro-F1 values of the models using the context selector is 3.02% higher than those of the models without context selector on average. This proves that the proposed context selector does help to improve the model’s ability to recognize deceptive and non-deceptive QA pairs in a multi-turn QA.

Table 5. An example that CSN-RoBERTa successfully identified Turn-5 as deception but RoBERTa-BiGRU-CC failed. CSN-RoBERTa chose to mask the QA pairs written in blue in order to predict the label of Turn-5.

5.4 Case Study

Table 5 shows an example that CSN-RoBERTa successfully predicted Turn-5 as deception by masking Turn-2, Turn-8, and Turn-9 while RoBERTa-BiGRU-CC which takes all of the contexts into consideration misclassified Turn-5.

According to the example, we can find that the masked contexts can be regarded as noise which is less relevant to Turn-5. Turn-2 talked about the time when the subject liked billiards that is relatively irrelevant to the subject’s experience in the game. Turn-7, Turn-8, and Turn-9 all talked about star players which could not provide effective information for judging whether Turn-5 is deceptive. Due to the inaccuracies of the model, only Turn-2, Turn-8, and Turn-9 are masked. This kind of noisy context can confuse the model and make it unable to classify Turn-5 correctly.

6 Conclusion

In this paper, we propose a novel task: deception detection in a multi-turn QA and a context selector network to model context-sensitive dependence. In addition, we build a high-quality dataset for the experiment. Empirical evaluation on the collected dataset indicates that our approach significantly outperforms several strong baseline approaches, showing that the QA contexts and the context selector do help the model effectively explore deceptive features. In the future, we would like to integrate user information to explore deeper deceptive signals in the multi-turn QA.