A two-domain coordinated sentence similarity scheme for question-answering robots regarding unpredictable outliers and non-orthogonal categories

Li, Boyang; Xu, Weisheng; Xu, Zhiyu; Li, Jiaxin; Peng, Peng

doi:10.1007/s10489-021-02269-7

A two-domain coordinated sentence similarity scheme for question-answering robots regarding unpredictable outliers and non-orthogonal categories

Published: 16 April 2021

Volume 51, pages 8928–8944, (2021)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Boyang Li^1,2,
Weisheng Xu^3,4,
Zhiyu Xu^3,5,
Jiaxin Li³ &
…
Peng Peng³

439 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

It is crucial and challenging for the question-answering robot (Qabot) to match the customer-input questions with the priori identification questions due to highly diversified expressions, especially in the case of Chinese. This article proposes a coordinated scheme to analyze the similarity between sentences in two independent domains instead of a single deep learning model. In the structure domain, the BLEU and data preprocessing are applied for binary analysis to discriminate the unpredictable outliers (illegal questions) to existing library. In the semantics domain, the MC-BERT model, which integrates the BERT encoder and the Multi-kernel convolutional top classifier, is developed to handle the non-orthogonality of class identification questions. The two-domain analyses are in parallel and the two similarity scores are coordinated for the final response. The linguistic features of Chinese are also taken into account. A realistic case of Qabot on energy trading service and finance is numerically studied. Computational results validate the effectiveness and accuracy of the proposed algorithm: Top-1 and Top-3 accuracies are 90.5% and 95.5%, respectively, which are significantly superior to the latest published results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Diksha Khurana, Aditya Koli, … Sukhdev Singh

A survey on deep learning approaches for text-to-SQL

Article Open access 23 January 2023

George Katsogiannis-Meimarakis & Georgia Koutrika

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Article Open access 17 February 2024

Marco Cascella, Federico Semeraro, … Elena Bignami

Notes

Abbreviations

Qabot::: question-answering robots
CNN::: convolutional neural network
BLEU::: bilingual evaluation understudy
BERT::: bidirectional encoder representation from transformers
m :: total number of customer-input questions
n :: total number of identification questions
\(\hat {q}_j\) :: j-th customer-input question
\(\hat {\mathbf {Q}}\) :: set of customer-input questions, \(\hat {\mathbf {Q}}={\{{\hat {q}}_j\}}_{j=1}^{m}\)
q i∗:: i-th identification question
Q ^∗ :: set of identification questions, \(\mathbf {Q}^{\ast }=\left \{q_i^{\ast }\right \}_{i=1}^n\)
u :: structural similarity score vector of a customer-input question \(\hat {q}\) with respect to all identification questions, \(\mathbf {u}\in \mathbb {R}^n\), and \(\mathbf {u}=(\begin {array}{ccc}u_1&\cdots &u_n)\\\end {array}\)
r :: negative enhancement score vector for structural similarity analysis, \(\mathbf {r}\in \mathbb {R}^n\), each element in this vector is the same negative value r, and \(\mathbf {r}=(\begin {array}{ccc}r&\cdots &r)\\\end {array}\)
v :: semantic similarity score vector of a customer-input question \(\hat {q}\) with respect to all identification questions, \(\mathbf {v}\in \mathbb {R}^n\), and \(\mathbf {v}=\begin {array}{ccc}(v_1&\cdots &v_n)\\\end {array}\)
s _j :: integrated similarity vector of j-th customer-input question with respect to all identification questions, \(\mathbf {s}\in \mathbb {R}^n\)
Z :: the concave nonlinear function, which expands the differences of BLEU score in lower-range
T :: the piecewise nonlinear function, which uses the threshold 𝜃 and the negative enhancement vector r to convert the result of Z to structural similarity score vector u
E :: the BERT encoder, which transfers a sentence to its embedding, an h-dimensional vector
α :: the semantic similarity embedding combination vector

References

Almansor EH, Hussain FK (2019) Survey on intelligent chatbots: State-of-the-art and future research directions. In: Barolli L, Hussain FK, Ikeda M (eds) Complex, intelligent, and software intensive systems - Proceedings of the 13th international conference on complex, intelligent, and software intensive systems, CISIS 2019, Sydney, NSW, Australia, 3-5 July 2019, of advances in intelligent systems and computing, vol 993. Springer, pp 534–543
Andor D, He L, Lee K, Pitler E (2019) Giving BERT a calculator: Finding operations and arguments with reading comprehension. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, association for computational linguistics, pp 5946–5951
Beltagy I, Lo K, Cohan A (2019) Scibert: A pretrained language model for scientific text. In: Inui K, Jiang J, Ng V, Wan X (eds) 3613–3618.
Bird S, Dale R, Dorr BJ, Gibson BR, Joseph MT, Kan M-Y, Lee D, Powley B, Radev DR, Tan YF (2008) The ACL anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco. European Language Resources Association
Chen D, Fisch A, Weston J, Bordes A (2017) Reading wikipedia to answer open-domain questions. In: Barzilay R, Kan M-Y (eds) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long papers, Association for computational linguistics, pp 1870–1879
Choromanski K, Likhosherstov V, Dohan D, Song X, Gane A, Sarlós T, Hawkins P, Davis J, Mohiuddin A, Kaiser L, Belanger D, Colwell L, Weller A (2020) Rethinking attention with performers. arXiv:2009.14794
Clark K, Khandelwal U, Levy O, Manning CD (2019) What does BERT look at? an analysis of bert’s attention. arXiv:1906.04341
Clark K, Luong M-T, Le QV, Manning CD (2020) ELECTRA: Pre-training text encoders as discriminators rather than generators. In: 8th international conference on learning representations, ICLR 2020, addis ababa, ethiopia, april 26-30, 2020. Openreview.net
Cohan A, Ammar W, Van Zuylen M, Cady F (2019) Structural scaffolds for citation intent classification in scientific publications. In: NAACL
Cui Y, Che W, Liu T, Qin B, Wang S, Hu G (2020) Revisiting pre-trained models for Chinese natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: findings, Online, November association for computational linguistics, pp 657–668
Cui Y, Che W, Liu T, Qin B, Yang Z, Wang S, Guoping H (2019) Pre-training with whole word masking for chinese BERT. arXiv:1906.08101
Evseev DA, Yu Arkhipov M (2020) Sparql query generation for complex question answering with bert and bilstm-based model. In: International Conference on Computational Linguistics and Intellectual Technologies “Dialogue”
Dai Z, Lai G, Yang Y, Le Q (2020) Funnel-transformer: Filtering out sequential redundancy for efficient language processing. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT (eds) Advances in neural information processing systems 33: Annual conference on neural information processing systems 2020, NeurIPS 2020, December 6-12, 2020, virtual
Dernoncourt F, Young Lee J (2017) Pubmed 200k RCT: A dataset for sequential sentence classification in medical abstracts. In: Kondrak G, Watanabe T (eds) Proceedings of the eighth international joint conference on natural language processing, IJCNLP 2017, Taipei, Taiwan, November 27 - December 1, 2017, Short Papers,. Asian Federation of natural language processing, vol 2, pp 308–313
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, (Long and Short Papers) association for computational linguistics, vol 1, pp 4171–4186
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville AC, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27: Annual conference on neural information processing systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp 2672–2680
He P, Liu X, Gao J, Chen W (2020) Deberta: Decoding-enhanced BERT with disentangled attention. arXiv:2006.03654
Hu D (2019) An introductory survey on attention mechanisms in NLP problems. In: Bi Y, Bhatia R, Kapoor S (eds) Intelligent systems and applications - Proceedings of the 2019 intelligent systems conference, IntelliSys 2019, London, UK, September 5-6, 2019, Volume 2, of advances in intelligent systems and computing, vol 432–448. Springer, p 1038
Jin D, Szolovits P (2018) Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J (eds) Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October 31 - November 4, 2018, association for computational linguistics, pp 3100–3109
Karpukhin V, Oguz B, Min S, Lewis PSH, Ledell W, Edunov S, Chen D, Yih W-T (2020) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020 Online, November 16-20, 2020 association for computational linguistics. In: Webber B, Cohn T, He Y, Liu Y (eds), pp 6769–6781
Kitaev N, Kaiser L, Levskaya A (2020) Reformer: The efficient transformer. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. In: Knight K, Nenkova A, Rambow O (eds) NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, The Association for Computational Linguistics, pp 260–270
Liu Y, Ott M, Goyal N, Jingfei D, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized BERT pretraining approach. arXiv:1907.11692
Neculoiu P, Versteegh M, Rotaru M (2016) Proceedings of the 1st Workshop on Representation Learning for NLP, rep4NLP@ACL 2016, Berlin, Germany, August 11, 2016 Association for Computational Linguistics. In: Blunsom P, Cho K, Cohen SB, Grefenstette E, Hermann KM, Rimell L, Weston J, Yih SW-T (eds), pp 148–157
Papineni Kishore, Roukos Salim, Ward Todd, Zhu W-J (2002) Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, July 6-12, 2002. ACL, Philadelphia, pp 311–318
Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, association for computational linguistics, pp 3980–3990
Sorokin D, Gurevych I (2018) Modeling semantics with gated graph neural networks for knowledge base question answering. In: Bender EM, Derczynski L, Isabelle P (eds) Proceedings of the 27th international conference on computational linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, association for computational linguistics, pp 3306–3317
Wang T, Liu L, Liu N, Zhang H, Zhang L, Feng S (2020) A multi-label text classification method via dynamic semantic representation model and deep neural network. Appl Intell 50(8):2339–2351
Article Google Scholar
Wei JW, Zou K (2019) EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, association for computational linguistics, pp 6381–6387
Xu N (2003) Chinese word segmentation as character tagging Int J Comput Linguistics Chin Lang Process 8(1)
Yamada I, Asai A, Shindo H, Takeda H, Matsumoto Y (2020) LUKE: Deep contextualized entity representations with entity-aware self-attention. In: Webber B, Cohn T, He Y, Liu Y (eds) Proceedings of the 2020 conference on empirical methods in natural language processing, EMNLP 2020, Online, November 16-20, 2020 Association for Computational Linguistics, pp 6442–6454
Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019) Xlnet: Generalized autoregressive pretraining for language understanding. In: wallach HM, Larochelle H, Beygelzimer A, d’Alchė-Buc F, Fox EB, Garnett R (eds) Advances in neural information processing systems 32: Annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp 5754–5764

Download references

Acknowledgements

This paper has been supported by Bigdata & Artificial Intelligence Laboratory of Tongji University and Shanghai Changtou Network Technology Co., Ltd. The authors also gratefully appreciate the iFLYTEK AI Research, Brightmart for the open source pre-trained model parameters trained based on Chinese. Same appreciate to HANLP for their opensource Chinese tokenizer. Thanks to Dr. Yubo Chen of Tsinghua University for his academic inspirations and professional instructions.

Funding

This work was funded by National Natural Science Foundation of China under Grant No. 61973238, 61773292, and Innovation Program of Science and Technology Commission of Shanghai Municipality under Grant No.19DZ1209200.

Author information

Authors and Affiliations

Das Chinesisch-Deutsche Hochschulkolleg, Tongji University, Shanghai, 200092, China
Boyang Li
Electrical Engineering and Information Technology, Technical University Munich, Munich, 80333, Germany
Boyang Li
College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, China
Weisheng Xu, Zhiyu Xu, Jiaxin Li & Peng Peng
Informatization Office, Tongji University, Shanghai, 200092, China
Weisheng Xu
National Computer and Information Technology Practical Education Demonstration Center, Tongji University, Shanghai, 201084, China
Zhiyu Xu

Authors

Boyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Weisheng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jiaxin Li
View author publications
You can also search for this author in PubMed Google Scholar
Peng Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiyu Xu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Case description

We evaluate the performance of our algorithms based on the realistic case of online energy trading service and finance, which contains three datasets as shown in Table 6: identification library set, training set, and test set. The customer-input questions in the training set are all inliers that can be matched to identification questions, i.e. there is no counterexample in training set. Most customer-input questions in the test set are inliers and there are also a few outliers. Each customer-input question in training set has one and only one corresponding identification question. All the questions are in Chinese. Due to the confidentiality agreement, unfortunately our training set cannot be published Table 6.

Table 6 Dataset description

Full size table

Appendix B: Evaluation index

Two groups of accuracy are defined to evaluate the performance: the former is to discriminate inliers and outliers in dichotomy classification, and the latter is to measure ratio of the correct matchmakings in multi-classification.

Group 1:

\(inlier\ accuracy=\frac {\mathrm {number\ of\ correctly\ identified\ inliers}}{\mathrm {total\ number\ of\ inliers\ in\ test\ set}}\times 100\%\)

\(outlier\ accuracy=\frac {\mathrm {number\ of\ correctly\ identified\ outliers}}{\mathrm {total\ number\ of\ outliers\ in\ test\ set}}\times 100\%\)

\(global\ accuracy=\frac {\mathrm {number\ of\ correctly\ identified\ inliers}\ and\mathrm {\ outliers}}{\mathrm {total\ number\ of\ inliers\ and\ outliers\ in\ test\ set}}\times 100\%\)

Group 2:

For a customer-input question \(\hat {q}\), it is compared to each identification question \(q_{i}^{\ast }\in \mathbf {Q}^{\ast }\) and the corresponding similarity score s_i can be computed. The Top-1 accuracy indicates that the top one \(q_{i}^{\ast }\) with the highest similarity score is the correct match.

\(Top-\mathrm {1\ \ }accuracy=\frac {number\ of\ correct\ Top\mathrm {-1\ }}{\mathrm {total\ number\ in\ test\ set}}\times 100\%\)

where

\(correct\ Top\mathrm {-1\ }match=\left \{\begin {array}{ll}1,&\text {if the }\mathit {Top}\text {-1 match is correct or the judgment of outlier is correct}\\0,&\text {otherwise} \end {array}\right .\)

The Top-3 accuracy indicates that the top three q^∗ with the highest similarity scores include the correct match.

\(Top-\mathrm {3\ \ }accuracy=\frac {number\ of\ correct\ Top\mathrm {-3\ }}{\mathrm {total\ number\ in\ test\ set}}\times 100\%\)

where

\(correct\ Top\mathrm {-3\ } match=\left \{\begin {array}{ll}1,&\text {if one of the }\mathit {Top}\text {-3 matches is correct or the judgment of outlier is correct}\\0,&\text {otherwise} \end {array}\right .\)

Obviously Top-1 is a subset of Top-3 so the Top-1 accuracy is no greater than the Top-3 accuracy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, B., Xu, W., Xu, Z. et al. A two-domain coordinated sentence similarity scheme for question-answering robots regarding unpredictable outliers and non-orthogonal categories. Appl Intell 51, 8928–8944 (2021). https://doi.org/10.1007/s10489-021-02269-7

Download citation

Accepted: 09 February 2021
Published: 16 April 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10489-021-02269-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A two-domain coordinated sentence similarity scheme for question-answering robots regarding unpredictable outliers and non-orthogonal categories

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on deep learning approaches for text-to-SQL

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Case description

Appendix B: Evaluation index

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A two-domain coordinated sentence similarity scheme for question-answering robots regarding unpredictable outliers and non-orthogonal categories

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on deep learning approaches for text-to-SQL

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Case description

Appendix B: Evaluation index

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation