Skip to main content
Log in

TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In the field of deep matching, a large amount of noisy data in Chinese long texts affects the matching effect. Most long-form text matching models use all text data indiscriminately, which results in a large amount of noisy data, and thus the PageRank algorithm is combined with Transformer to filter noise. For sentence-level noise detection, after calculating the overlap rate of words to evaluate the similarity, a sentence-level relationship graph is constructed and filtered by using the PageRank algorithm; for word-level noise detection, based on the attention score in Transformer, a word graph is established, then the PageRank algorithm is executed on graph, combined with self-attention weights, to select keywords to highlight topic relevance, the noisy words are filtered sequentially at different layers in the module, layer by layer. In addition, during the model training, PolyLoss is applied to replace the traditional binary Cross-Entropy loss function, thus reducing the difficulty of hyperparameter tuning. Finally, a better filtering strategy is proposed and experiments are conducted to verify it on two Chinese long-form text matching datasets. The result shows that the matching model based on the noise filtering strategy of this paper can better filter the noise and capture the matching signal more accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://github.com/BangLiu/ArticlePairMatching

  2. https://github.com/pl8787/Match-Ignition

References

  1. Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J (2021) Deep learning-based text classification: a comprehensive review. ACM Computing Surveys (CSUR) 54(3):1–40

    Article  Google Scholar 

  2. Shen, Y., He, X., Gao, J., Deng, L., Mesnil, G.: A latent semantic model with convolutional-pooling structure for information retrieval. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 101–110 (2014)

  3. Kaur R, Chana I, Bhattacharya J (2018) Data deduplication techniques for efficient cloud storage management: a systematic review. The Journal of Supercomputing 74(5):2035–2085

    Article  Google Scholar 

  4. Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X: Papers, pp. 79–86 (2005)

  5. Wu, C., Wu, F., Ge, S., Qi, T., Huang, Y., Xie, X.: Neural news recommendation with multi-head self-attention. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6389–6394 (2019)

  6. Tang, X., Luo, Y., Xiong, D., Yang, J., Li, R., Peng, D.: Short text matching model with multiway semantic interaction based on multi-granularity semantic embedding. Applied Intelligence, 1–11 (2022)

  7. Liu M, Zhang Y, Xu J, Chen Y (2021) Deep bi-directional interaction network for sentence matching. Applied Intelligence 51(7):4305–4329

    Article  Google Scholar 

  8. Robertson, S., Zaragoza, H. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval 3(4), 333–389 (2009)

  9. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine Learning research 3(Jan), 993–1022 (2003)

  10. Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., Heck, L.: Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2333–2338 (2013)

  11. Shen, Y., He, X., Gao, J., Deng, L., Mesnil, G.: A latent semantic model with convolutional-pooling structure for information retrieval. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 101–110 (2014)

  12. Hu, B., Lu, Z., Li, H., Chen, Q.: Convolutional neural network architectures for matching natural language sentences. Advances in neural information processing systems 27 (2014)

  13. Pang, L., Lan, Y., Guo, J., Xu, J., Wan, S., Cheng, X.: Text matching as image recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)

  14. Mitra, B., Diaz, F., Craswell, N.: Learning to match using local and distributed representations of text for web search. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1291–1299 (2017)

  15. Yang, R., Zhang, J., Gao, X., Ji, F., Chen, H.: Simple and effective text matching with richer alignment features. In: Association for Computational Linguistics, pp. 4699–4709 (2019)

  16. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)

  17. Jiang, J.-Y., Zhang, M., Li, C., Bendersky, M., Golbandi, N., Najork, M.: Semantic text matching for long-form documents. In: The World Wide Web Conference, pp. 795–806 (2019)

  18. Yang, L., Zhang, M., Li, C., Bendersky, M., Najork, M.: Beyond 512 tokens: Siamese multi-depth transformer-based hierarchical encoder for long-form document matching. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1725–1734 (2020)

  19. Liu, B., Niu, D., Wei, H., Lin, J., He, Y., Lai, K., Xu, Y.: Matching article pairs with graphical decomposition and convolutions. In: Association for Computational Linguistics (2019)

  20. Zhang S, He F (2020) Drcdn: learning deep residual convolutional dehazing networks. The Visual Computer 36(9):1797–1808

    Article  Google Scholar 

  21. Pang, L., Lan, Y., Cheng, X.: Match-ignition: Plugging pagerank into transformer for long-form text matching. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 1396–1405 (2021)

  22. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30(1–7):107–117

    Article  Google Scholar 

  23. Wan, S., Lan, Y., Guo, J., Xu, J., Pang, L., Cheng, X.: A deep architecture for semantic matching with multiple positional sentence representations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)

  24. Mirakyan, M., Hambardzumyan, K., Khachatrian, H.: Natural language inference over interaction space. (2018)

  25. Xiong, C., Dai, Z., Callan, J., Liu, Z., Power, R.: End-to-end neural ad-hoc ranking with kernel pooling. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 55–64 (2017)

  26. Dai, Z., Xiong, C., Callan, J., Liu, Z.: Convolutional neural networks for soft-matching n-grams in ad-hoc search. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 126–134 (2018)

  27. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30:5998–6008

    Google Scholar 

  28. Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint http://arxiv.org/abs/1904.10509arXiv:1904.10509 (2019)

  29. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: Attentive language models beyond a fixed-length context. In: In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988 (2019)

  30. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992 (2019)

  31. Chien-Sheng Wu, R.S. Steven C.H. Hoi, Xiong, C.: Tod-bert: Pre-trained natural language understanding for task-oriented dialogue. In: In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 917–929 (2020)

  32. Wei Liu, Y.Z. Xiyan Fu, Xiao, W.: Lexicon enhanced chinese sequence labelling using bert adapter. In: In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5847–5858 (2021)

  33. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: The efficient transformer. In: International Conference on Learning Representations (2020)

  34. Jiezhong Qiu, O.L.W.-t.Y.S.W. Hao Ma, Tang, J.: Blockwise self-attention for long document understanding, pp. 2555–2565 (2020)

  35. Rae, J.W., Potapenko, A., Jayakumar, S.M., Lillicrap, T.P.: Compressive transformers for long-range sequence modelling. In: International Conference on Learning Representations (2020)

  36. Sukhbaatar, S., Grave, E., Bojanowski, P., Joulin, A.: Adaptive attention span in transformers. In: Association for Computational Linguistics (2019)

  37. Zhang, X., Wei, F., Zhou, M.: Hibert: Document level pre-training of hierarchical bidirectional transformers for document summarization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5059–5069 (2019)

  38. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

  39. Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004)

  40. Tang, W., He, F., Liu, Y.: Ydtr: infrared and visible image fusion via y-shape dynamic transformer. IEEE Transactions on Multimedia (2022)

  41. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30(1–7):107–117

    Article  Google Scholar 

  42. Guo, J., Fan, Y., Ai, Q., Croft, W.B.: A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 55–64 (2016)

  43. Fan, Y., Guo, J., Lan, Y., Xu, J., Zhai, C., Cheng, X.: Modeling diverse relevance patterns in ad-hoc retrieval. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 375–384 (2018)

  44. Si, T., He, F., Zhang, Z., Duan, Y.: Hybrid contrastive learning for unsupervised person re-identification. IEEE Transactions on Multimedia (2022)

  45. Leng, Z., Tan, M., Liu, C., Cubuk, E.D., Shi, X., Cheng, S., Anguelov, D.: Polyloss: A polynomial expansion perspective of classification loss functions. In: International Conference on Learning Representations (2022)

  46. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: Where bigger models and more data hurt. In: International Conference on Learning Representations (2020)

  47. Yang Z, Garcia N, Chu C, Otani M, Nakashima Y, Takemura H (2021) A comparative study of language transformers for video question answering. Neurocomputing 445:121–133

    Article  Google Scholar 

  48. Liu, P., Wang, X., Wang, L., Ye, W., Xi, X., Zhang, S.: Distilling knowledge from bert into simple fully connected neural networks for efficient vertical retrieval. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 3965–3975 (2021)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liuhui Hu.

Ethics declarations

Conflicts of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gan, L., Hu, L., Tan, X. et al. TBNF:A Transformer-based Noise Filtering Method for Chinese Long-form Text Matching. Appl Intell 53, 22313–22327 (2023). https://doi.org/10.1007/s10489-023-04607-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04607-3

Keywords

Navigation