TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching

Gan, Ling; Hu, Liuhui; Tan, Xiaodong; Du, Xinrui

doi:10.1007/s10489-023-04607-3

TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching

Published: 27 June 2023

Volume 53, pages 22313–22327, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Ling Gan¹,
Liuhui Hu ORCID: orcid.org/0000-0003-3540-071X²,
Xiaodong Tan¹ &
…
Xinrui Du¹

255 Accesses
1 Citation
Explore all metrics

Abstract

In the field of deep matching, a large amount of noisy data in Chinese long texts affects the matching effect. Most long-form text matching models use all text data indiscriminately, which results in a large amount of noisy data, and thus the PageRank algorithm is combined with Transformer to filter noise. For sentence-level noise detection, after calculating the overlap rate of words to evaluate the similarity, a sentence-level relationship graph is constructed and filtered by using the PageRank algorithm; for word-level noise detection, based on the attention score in Transformer, a word graph is established, then the PageRank algorithm is executed on graph, combined with self-attention weights, to select keywords to highlight topic relevance, the noisy words are filtered sequentially at different layers in the module, layer by layer. In addition, during the model training, PolyLoss is applied to replace the traditional binary Cross-Entropy loss function, thus reducing the difficulty of hyperparameter tuning. Finally, a better filtering strategy is proposed and experiments are conducted to verify it on two Chinese long-form text matching datasets. The result shows that the matching model based on the noise filtering strategy of this paper can better filter the noise and capture the matching signal more accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A multi-semantic passing framework for semi-supervised long text classification

Article 31 March 2023

Multi-granularity Chinese Text Matching Model Combined with Bidirectional Attention

Deep-KeywordNet: automated english keyword extraction in documents using deep keyword network based ranking

Article 30 January 2024

Notes

References

Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J (2021) Deep learning-based text classification: a comprehensive review. ACM Computing Surveys (CSUR) 54(3):1–40
Article Google Scholar
Shen, Y., He, X., Gao, J., Deng, L., Mesnil, G.: A latent semantic model with convolutional-pooling structure for information retrieval. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 101–110 (2014)
Kaur R, Chana I, Bhattacharya J (2018) Data deduplication techniques for efficient cloud storage management: a systematic review. The Journal of Supercomputing 74(5):2035–2085
Article Google Scholar
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X: Papers, pp. 79–86 (2005)
Wu, C., Wu, F., Ge, S., Qi, T., Huang, Y., Xie, X.: Neural news recommendation with multi-head self-attention. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6389–6394 (2019)
Tang, X., Luo, Y., Xiong, D., Yang, J., Li, R., Peng, D.: Short text matching model with multiway semantic interaction based on multi-granularity semantic embedding. Applied Intelligence, 1–11 (2022)
Liu M, Zhang Y, Xu J, Chen Y (2021) Deep bi-directional interaction network for sentence matching. Applied Intelligence 51(7):4305–4329
Article Google Scholar
Robertson, S., Zaragoza, H. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval 3(4), 333–389 (2009)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine Learning research 3(Jan), 993–1022 (2003)
Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., Heck, L.: Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2333–2338 (2013)
Shen, Y., He, X., Gao, J., Deng, L., Mesnil, G.: A latent semantic model with convolutional-pooling structure for information retrieval. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 101–110 (2014)
Hu, B., Lu, Z., Li, H., Chen, Q.: Convolutional neural network architectures for matching natural language sentences. Advances in neural information processing systems 27 (2014)
Pang, L., Lan, Y., Guo, J., Xu, J., Wan, S., Cheng, X.: Text matching as image recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)
Mitra, B., Diaz, F., Craswell, N.: Learning to match using local and distributed representations of text for web search. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1291–1299 (2017)
Yang, R., Zhang, J., Gao, X., Ji, F., Chen, H.: Simple and effective text matching with richer alignment features. In: Association for Computational Linguistics, pp. 4699–4709 (2019)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)
Jiang, J.-Y., Zhang, M., Li, C., Bendersky, M., Golbandi, N., Najork, M.: Semantic text matching for long-form documents. In: The World Wide Web Conference, pp. 795–806 (2019)
Yang, L., Zhang, M., Li, C., Bendersky, M., Najork, M.: Beyond 512 tokens: Siamese multi-depth transformer-based hierarchical encoder for long-form document matching. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1725–1734 (2020)
Liu, B., Niu, D., Wei, H., Lin, J., He, Y., Lai, K., Xu, Y.: Matching article pairs with graphical decomposition and convolutions. In: Association for Computational Linguistics (2019)
Zhang S, He F (2020) Drcdn: learning deep residual convolutional dehazing networks. The Visual Computer 36(9):1797–1808
Article Google Scholar
Pang, L., Lan, Y., Cheng, X.: Match-ignition: Plugging pagerank into transformer for long-form text matching. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 1396–1405 (2021)
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30(1–7):107–117
Article Google Scholar
Wan, S., Lan, Y., Guo, J., Xu, J., Pang, L., Cheng, X.: A deep architecture for semantic matching with multiple positional sentence representations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)
Mirakyan, M., Hambardzumyan, K., Khachatrian, H.: Natural language inference over interaction space. (2018)
Xiong, C., Dai, Z., Callan, J., Liu, Z., Power, R.: End-to-end neural ad-hoc ranking with kernel pooling. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 55–64 (2017)
Dai, Z., Xiong, C., Callan, J., Liu, Z.: Convolutional neural networks for soft-matching n-grams in ad-hoc search. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 126–134 (2018)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30:5998–6008
Google Scholar
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint http://arxiv.org/abs/1904.10509arXiv:1904.10509 (2019)
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: Attentive language models beyond a fixed-length context. In: In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988 (2019)
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992 (2019)
Chien-Sheng Wu, R.S. Steven C.H. Hoi, Xiong, C.: Tod-bert: Pre-trained natural language understanding for task-oriented dialogue. In: In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 917–929 (2020)
Wei Liu, Y.Z. Xiyan Fu, Xiao, W.: Lexicon enhanced chinese sequence labelling using bert adapter. In: In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5847–5858 (2021)
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: The efficient transformer. In: International Conference on Learning Representations (2020)
Jiezhong Qiu, O.L.W.-t.Y.S.W. Hao Ma, Tang, J.: Blockwise self-attention for long document understanding, pp. 2555–2565 (2020)
Rae, J.W., Potapenko, A., Jayakumar, S.M., Lillicrap, T.P.: Compressive transformers for long-range sequence modelling. In: International Conference on Learning Representations (2020)
Sukhbaatar, S., Grave, E., Bojanowski, P., Joulin, A.: Adaptive attention span in transformers. In: Association for Computational Linguistics (2019)
Zhang, X., Wei, F., Zhou, M.: Hibert: Document level pre-training of hierarchical bidirectional transformers for document summarization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5059–5069 (2019)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004)
Tang, W., He, F., Liu, Y.: Ydtr: infrared and visible image fusion via y-shape dynamic transformer. IEEE Transactions on Multimedia (2022)
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30(1–7):107–117
Article Google Scholar
Guo, J., Fan, Y., Ai, Q., Croft, W.B.: A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 55–64 (2016)
Fan, Y., Guo, J., Lan, Y., Xu, J., Zhai, C., Cheng, X.: Modeling diverse relevance patterns in ad-hoc retrieval. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 375–384 (2018)
Si, T., He, F., Zhang, Z., Duan, Y.: Hybrid contrastive learning for unsupervised person re-identification. IEEE Transactions on Multimedia (2022)
Leng, Z., Tan, M., Liu, C., Cubuk, E.D., Shi, X., Cheng, S., Anguelov, D.: Polyloss: A polynomial expansion perspective of classification loss functions. In: International Conference on Learning Representations (2022)
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: Where bigger models and more data hurt. In: International Conference on Learning Representations (2020)
Yang Z, Garcia N, Chu C, Otani M, Nakashima Y, Takemura H (2021) A comparative study of language transformers for video question answering. Neurocomputing 445:121–133
Article Google Scholar
Liu, P., Wang, X., Wang, L., Ye, W., Xi, X., Zhang, S.: Distilling knowledge from bert into simple fully connected neural networks for efficient vertical retrieval. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 3965–3975 (2021)

Download references

Author information

Authors and Affiliations

School of Computer, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
Ling Gan, Xiaodong Tan & Xinrui Du
School of Software Engineering, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
Liuhui Hu

Authors

Ling Gan
View author publications
You can also search for this author in PubMed Google Scholar
Liuhui Hu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Tan
View author publications
You can also search for this author in PubMed Google Scholar
Xinrui Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liuhui Hu.

Ethics declarations

Conflicts of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gan, L., Hu, L., Tan, X. et al. TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching. Appl Intell 53, 22313–22327 (2023). https://doi.org/10.1007/s10489-023-04607-3

Download citation

Accepted: 01 April 2023
Published: 27 June 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10489-023-04607-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching

Abstract

Access this article

Similar content being viewed by others

A multi-semantic passing framework for semi-supervised long text classification

Multi-granularity Chinese Text Matching Model Combined with Bidirectional Attention

Deep-KeywordNet: automated english keyword extraction in documents using deep keyword network based ranking

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching

Abstract

Access this article

Similar content being viewed by others

A multi-semantic passing framework for semi-supervised long text classification

Multi-granularity Chinese Text Matching Model Combined with Bidirectional Attention

Deep-KeywordNet: automated english keyword extraction in documents using deep keyword network based ranking

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation