Elsevier

Neurocomputing

Volume 524, 1 March 2023, Pages 167-177
Neurocomputing

HiBERT: Detecting the illogical patterns with hierarchical BERT for multi-turn dialogue reasoning

https://doi.org/10.1016/j.neucom.2022.12.038Get rights and content

Abstract

Dialogue reasoning is a new task beyond the traditional dialogue system, because it requires recognizing not only semantic relevance but also logical consistency between the candidate response and the dialogue history. Like “all happy families are happy alike, all unhappy families are unhappy in their own way”, various illogical patterns exist in the data. For example, some candidate responses use many similar words but with contradicted meanings with history; while some candidates may employ totally different words but convey consistent meanings. Therefore, an ideal dialogue reasoning model should gather clues from both coarse-grained utterance-level and fine-grained word-level to determine the logical relation between candidates and the dialogue history. However, traditional models mainly rely on the widely used BERT to read all the history and candidates word by word but ignore the utterance-level signals, which cannot well capture various illogical patterns in this task. To tackle this problem, we propose a novel Hierarchical BERT (HiBERT) to recognize both utterance-level and word-level illogical patterns in this paper. Specifically, BERT is firstly utilized to encode the dialogue history and each candidate response as the contextualized representation. Secondly, hierarchical reasoning architecture is conducted with this contextualized representation to obtain the word-level and the utterance-level attention distributions, respectively. In detail, we utilize the word-grained attention mechanism to obtain the word-level representation, and propose two different types of attention function, i.e, hard attention and soft attention, to obtain the utterance-grained representation. Finally, we fuse both the word-grained representation and the utterance-grained representation to calculate the logical ranking scores for the given candidate. Experimental results on two public dialogue datasets show that our model obtains higher ranking measures than the widely used BERT model, validating the effectiveness of hierarchical reading of HiBERT. Further analysis on the impact of context length and attention weights shows that the HiBERT actually has the ability to recognize different illogical patterns.

Introduction

In the traditional dialogue task, most existing models [1], [2] only focus on the semantic relevance between the candidate response and dialogue history, but neglect the logical consistency between them. As a result, those models easily produce the illogical response, such as “My wife didn’t say anything about number” in dialogue history and “My wife said something about number” in the response as shown in Table 1, which greatly affects the user’s experience. To facilitate conversation reasoning research, Cui et al. [3] proposes a new multi-turn dialogue reasoning task and releases a reasoning dialogue dataset, named MuTual, based on the Chinese high school English listening comprehension test1. Given the dialogue history, question, and similar options (candidate responses), the annotators are required to rewrite the question and options as candidate responses. Therefore, all the candidates are relevant and look-alike but with a slight difference to the dialogue history, which is extremely suitable for the multi-turn dialogue reasoning task.

This reasoning task is challenging, because there are various illogical patterns that exist in the data. For example, some candidate responses use many similar words but are in conflict with dialogue history. Take option 1 and context 4 in Table 1 as an example, option 1 says “You mean you cannot look it up even though I remember what day she brought it in?” and context4 is “I can look it up if you remember what day she brought it in”. We can see that they both look extremely alike but with a minor difference, i.e., “not”. Furthermore, another illogical pattern is that some candidate responses use totally different words but have consistent meanings. Take option 4 and context 3 in Table 1 as an example, option 4 is “This dress is very important for her to take part in the party this evening.” and context 3 is “She needs that dress for a party we are attending tonight.”. We can see that the option and context look extremely different but mean the consistent information. Therefore, an ideal dialogue reasoning model should gather logical clues from both coarse-grained utterance-level and fine-grained word-level to determine the logical relation between candidates and the dialogue history.

However, traditional models mainly rely on the widely used BERT model to read all the context and candidates carefully at the word granularity but ignore the utterance-level semantic information, which makes it difficult to recognize the various illogical patterns in the dialogue reasoning task. For the similar words with illogical patterns, if we read all the context and options carefully at the word granularity, as in classic BERT, it is likely that some unrelated noise will be introduced into the model, and the effectiveness of the logical judgment will be hurt significantly. For example in Table 1, option 1 and context 4 use many similar words but with a minor difference, i.e., “not”. Coincidentally, this different word appears in the nearby context 2 “not by name”, which can be seen as an unrelated noise. For the dissimilar word with the same consistency pattern, if we read all the context and options word by word, it is likely that the utterance-level semantic information will be lost, which may affect the process of building the consistent relationship between the logical response and its look-different dialogue history. Since the logical clues usually depend on some option-relevant context, rather than the whole context, it is critical to introduce the utterance-level semantic information and then focus on the option-related contexts.

To alleviate this problem, we propose a novel Hierarchical BERT (HiBERT) to recognize both utterance-level and word-level illogical patterns. Specifically, we first adopt BERT to obtain the contextual representation for each word in context and option. And then we conduct the hierarchical reasoning architecture with the contextual representation, including the word-level and utterance-level reasoning modules. For the word-level reasoning, we utilize the word-grained attention mechanism to obtain the word-level representation. For the utterance-level reasoning, we propose two different types of attention function: 1) the hard attention, which only focuses on the option-related utterances and leaves others out. The utterance selection module is learned by the pseudo-annotated data; 2) the soft attention, which considers the whole context with attention distribution and pay more attention to the option-related utterances. The reason why we propose these two functions is that it can capture the correlation between each utterance and option from multiple perspectives [4], which carries increasingly correlation information. Finally, we fuse both the word-level representation and the utterance-level reasoning representation to predict the logical scores for the given candidate.

In our experiments, we use two public multi-turn dialogue datasets, named MuTual and Ubuntu, to evaluate our proposed models. The results show that HiBERT has the ability to rank the candidate responses more accurately than the widely used BERT model. Especially, on the MuTual dev set, the HiBERT can obtain a 3.0% improvement of R4@1. Further analysis on the impact of context length reveals that there is a big gap between HiBERT and the BERT for longer context, especially for the context longer than 150. In addition to this, the attention distribution map shows that the selected option-related contexts by HiBERT are highly coherent with human’s understanding, validating the explanation and correctness of HiBERT.

The main contributions of this paper include:

  • We introduce the hierarchical architecture to enhance the reasoning ability of BERT for the dialogue reasoning task.

  • We propose two types of attention function for the utterance-grained reading, including the hard attention and the soft attention.

  • We evaluate HiBERT on two public multi-turn dialogue datasets and conduct rigorous experiments to demonstrate the effectiveness of our proposed models.

Section snippets

Related work

More attention has been paid to the multi-turn dialogue in academia [6], [7] and industry [5]. One reason is that it is more similar to real-world scenarios, such as chatbots and artificial assistants. More importantly, multi-turn dialogue needs to consider more information and constraints [8], [9], [10], [11], which brings more challenging research in this field.

Previous dialogue models [12], [11], [13], [14], [15], [16], [17] mostly focus on RNN-based encoder-decoder forms [18], which

Model

In this section, we first describe the task definition, and then introduce our proposed model in detail, with the architecture shown in Fig. 1. Our proposed model consists of three components, i.e., contextual encoding, hierarchical reasoning, and response prediction. Given the dialogue context and one of its candidate options, we first utilize the pre-trained language model BERT to encode each token in context and option into a fixed-length vector, which carries contextual information.

Experiments

In this section, we conduct experiments on two public dialogue datasets to evaluate our proposed method. We first introduce the experimental settings, then we present the metric-based and human-based evaluation results to verify the effectiveness of our proposed model, including the comparison with baselines and the ablation results. Finally, we analyze the context length and attention weights to show that HiBERT can actually have the ability to recognize different illogical patterns.

Conclusion

In this paper, we focus on multi-turn dialogue reasoning tasks and propose HiBERT. The motivation comes from the fact that the widely used BERT only reads all the history and candidates word by word but ignores the utterance-level signals, resulting in poor performance in reasoning the following patterns: 1) some candidate responses use many similar words but with contradicted meanings with history; 2) some candidates may employ totally different words but convey consistent meanings. To address

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is supported by National Key R&D Program of China (Grant No.2022YFB3103703), Guangxi Key Laboratory of Cryptography and Information Security (No.GCIS202111), National Natural Science Foundation of China (Grant No.U21A20468, 52071312, 61972043, 61921003), The Open Program of Zhejiang Lab (Grant No.2021PD0AB02), The Fundamental Research Funds for the Central Universities (Grant No.2020XD-A07-1), and Hebei Higher Education Teaching Reform Research and Practice Project (Grant

Xu Wang received the Ph.D. degree from the State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing. Now he is currently working as an Assistant Professor in School of Artificial Intelligent in Hebei University of technology. His current research interests include knowledge graph, dialogue and question answering.

References (40)

  • A. Sordoni et al.

    A hierarchical recurrent encoder-decoder for generative context-aware query suggestion

  • C. Xing, Y. Wu, W. Wu, Y. Huang, M. Zhou, Hierarchical recurrent attention network for response generation, in:...
  • L. Cui, Y. Wu, S. Liu, Y. Zhang, M. Zhou, Mutual: A dataset for multi-turn dialogue reasoning, ACL,...
  • K. Xu, J. Ba, R. Kiros, K. Cho, A.C. Courville, R. Salakhutdinov, R.S. Zemel, Y. Bengio, Show, attend and tell: Neural...
  • S. Wu et al.

    Diverse and informative dialogue generation with context-specific commonsense knowledge awareness

    ACL

    (2020)
  • H. Cho, J. May, Grounding conversations with improvised dialogues, in: D. Jurafsky, J. Chai, N. Schluter, J.R....
  • M. Roddy, N. Harte, Neural generation of dialogue response timings, in: D. Jurafsky, J. Chai, N. Schluter, J.R....
  • W. Zhang et al.

    Context-sensitive generation of open-domain conversational responses

    COLING

    (2018)
  • H. Chen, Z. Ren, J. Tang, Y.E. Zhao, D. Yin, Hierarchical variational memory network for dialogue generation, WWW,...
  • H. Zhang, Y. Lan, L. Pang, J. Guo, X. Cheng, Recosa: Detecting the relevant contexts with self-attention for multi-turn...
  • Y. Wu et al.

    Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots

    ACL

    (2017)
  • I.V. Serban, A. Sordoni, Y. Bengio, A. Courville, J. Pineau, Building end-to-end dialogue systems using generative...
  • H. Zeng, J. Liu, M. Wang, B. Wei, A sequence to sequence model for dialogue generation with gated mixture of topics,...
  • T. Young, V. Pandelea, S. Poria, E. Cambria, Dialogue systems with audio context,...
  • V. Tran, L. Nguyen, Gating mechanism based natural language generation for spoken dialogue systems,...
  • D. Griol, A. Sanchis, J.M. Molina, Z. Callejas, Developing enhanced conversational agents for social virtual worlds,...
  • D. Ren, Y. Cai, X. Lei, J. Xu, Q. Li, H. Leung, A multi-encoder neural conversation model,...
  • S. Santhanam, S. Shaikh, A survey of natural language generation techniques with a focus on dialogue systems – past,...
  • H. Chen, X. Liu, D. Yin, J. Tang, A survey on dialogue systems: Recent advances and new frontiers,...
  • C. Xu et al.

    Neural response generation with meta-words

    ACL

    (2019)
  • Cited by (1)

    Xu Wang received the Ph.D. degree from the State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing. Now he is currently working as an Assistant Professor in School of Artificial Intelligent in Hebei University of technology. His current research interests include knowledge graph, dialogue and question answering.

    Hainan Zhang got her Ph.D. degree from the Institute of computing technology, Chinese Academy of Sciences in 2019. Now she works as an research scientist at Data Science Lab, JD.com. Her main research directions are: dialogue generation, content generation in e-commerce scenarios. In recent three years, she has published 6 papers as the first author and 5 papers as the corresponding author in the top-tier conferences, such as ACL, EMNLP, SIGIR, AAAI, IJCAI.

    Shuai Zhao (M’ 19) received the Ph.D. degree in computer science and technology from the Beijing University of Posts and Telecommunications (supervisor: Prof. Junliang Chen) in June 2014. He is an Associate Professor with the State Key Laboratory of Networking and Switching Technology at Beijing University of Posts and Telecommunications, and also with Beijing Advanced Innovation Center for Future Internet Technology. His current research interests include Internet of Things technology and service computing.

    Hongshen Chen is NLP lead for recommendation platform, JD.com. He received Ph.D. from Institute of Computing Technology, Chinese Academy of Sciences in 2017. Currently, He is focusing on natural language processing, dialogue systems, sequence labeling, etc. He won best paper award on CCL 2016, and has published over 30 papers on top tier AI conferences including ACL, EMNLP,COLING, AAAI, IJCAI, WWW, CIKM, etc.

    Bo Cheng (M’ 12) received the Ph.D. degree in computer science from the University of Electronics Science and Technology of China, Chengdu, China, in 2006. He is a Professor with the State Key Laboratory of Networking and Switching Technology at Beijing University of Posts and Telecommunications, and also with Beijing Advanced Innovation Center for Future Internet Technology, Beijing, China. His research interests network services and intelligence, Internet of Things technology, communication software and distribute computing, etc. He is a member of the IEEE.

    Zhuoye Ding is NLP lead for recommendation platform, JD.com. He received Ph.D. from Fudan University in 2013. Currently, He is focusing on natural language processing, dialogue systems, etc. He has published over 20 papers on top tier AI conferences including KDD, EMNLP, AAAI, IJCAI, CIKM, etc.

    Sulong Xu is Head of Search and Recommendation Business Unit. He received the M.S. degrees from Southeast University in 2011. Currently, He is focusing on natural language processing, recommendation systems, etc.

    Weipeng Yan is VP of JD Group. He received the Ph.D. degrees from University of Waterloo. Currently, He is focusing on artificial intelligence, natural language processing, recommendation systems, etc.

    Yanyan Lan is currently a Professor with the Institute of AI Industry Research, Tsinghua University. Her current research interests include machine learning, web search and data mining, and big data analysis. She has published more than 60 papers on top conferences, including ICML, NIPS, SIGIR, and WWW, and the article entitled ”Top-k Learning to Rank: Labeling, Ranking, and Evaluation” has received the Best Student Paper Award of SIGIR 2012. She was awarded Best Paper Runner-Up Award of CIKM, in 2017, Outstanding Reviewer of SIGIR, in 2017, and Youth Innovation Promotion Association, CAS.

    View full text