skip to main content
10.1145/3404835.3463083acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

KeyBLD: Selecting Key Blocks with Local Pre-ranking for Long Document Information Retrieval

Published: 11 July 2021 Publication History

Abstract

Transformer-based models, and especially pre-trained language models like BERT, have shown great success on a variety of Natural Language Processing and Information Retrieval tasks. However, such models have difficulties to process long documents due to the quadratic complexity of the self-attention mechanism. Recent works either truncate long documents or segment them into passages that can be treated by a standard BERT model. A hierarchical architecture, such as a transformer, can be further adopted to build a document-level representation on top of the representations of each passage. However, these approaches either lose information or have high computational complexity (and are both time and energy consuming in this latter case). We follow here a slightly different approach in which one first selects key blocks of a long document by local query-block pre-ranking, and then aggregates few blocks to form a short document that can be processed by a model such as BERT. Experiments conducted on standard Information Retrieval datasets demonstrate the effectiveness of the proposed approach.

References

[1]
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. 2020. ETC: Encoding Long and Structured Inputs in Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 268--284.
[2]
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arxiv: 2004.05150
[3]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. CoRR, Vol. abs/1904.10509 (2019).
[4]
Zhuyun Dai and Jamie Callan. 2019. Deeper text understanding for IR with contextual neural language modeling. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 985--988.
[5]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, Vol. abs/1810.04805 (2018).
[7]
Ming Ding, Chang Zhou, Hongxia Yang, and Jie Tang. 2020. CogLTX: Applying BERT to Long Texts. In Advances in Neural Information Processing Systems, Vol. 33. 12792--12804.
[8]
Yixing Fan, Jiafeng Guo, Yanyan Lan, Jun Xu, Chengxiang Zhai, and Xueqi Cheng. 2018. Modeling diverse relevance patterns in ad-hoc retrieval. In The 41st international ACM SIGIR conference on research & development in information retrieval. 375--384.
[9]
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM international on conference on information and knowledge management. 55--64.
[10]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2333--2338.
[11]
Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. PACRR: A Position-Aware Neural IR Model for Relevance Matching. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1049--1058.
[12]
Jyun-Yu Jiang, Chenyan Xiong, Chia-Jung Lee, and Wei Wang. 2020. Long Document Ranking with Query-Directed Sparse Transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 4594--4605.
[13]
Chris Kamphuis, Arjen P de Vries, Leonid Boytsov, and Jimmy Lin. 2020. Which BM25 do you mean? A large-scale reproducibility study of scoring variants. In European Conference on Information Retrieval. Springer, 28--34.
[14]
Nikita Kitaev, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. arxiv: 2001.04451
[15]
Canjia Li, Yingfei Sun, Ben He, Le Wang, Kai Hui, Andrew Yates, Le Sun, and Jungang Xu. 2018. NPRF: A Neural Pseudo Relevance Feedback Framework for Ad-hoc Information Retrieval. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4482--4491.
[16]
Canjia Li, Andrew Yates, Sean MacAvaney, Ben He, and Yingfei Sun. 2020. PARADE: Passage representation aggregation for document reranking. arXiv preprint arXiv:2008.09093 (2020).
[17]
Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR: Contextualized embeddings for document ranking. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1101--1104.
[18]
Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Scoring, term weighting and the vector space model. Introduction to information retrieval, Vol. 100 (2008), 2--4.
[19]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPS.
[20]
Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019).
[21]
Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 24, 4 (2016), 694--707.
[22]
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text matching as image recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.
[23]
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. 2017. Deeprank: A new deep architecture for relevance ranking in information retrieval. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 257--266.
[24]
Stephen E. Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr., Vol. 3, 4 (2009), 333--389.
[25]
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management. 101--110.
[26]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, Vol. 15, 1 (2014), 1929--1958.
[27]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000--6010.
[28]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. HuggingFace's Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
[29]
Ho Chung Wu, Robert WP Luk, Kam-Fai Wong, and KL Kwok. 2007. A retrospective study of a hybrid document-context based retrieval model. Information processing & management, Vol. 43, 5 (2007), 1308--1331.
[30]
Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval. 55--64.
[31]
Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible ranking baselines using Lucene. Journal of Data and Information Quality (JDIQ), Vol. 10, 4 (2018), 1--20.
[32]
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2021. Big Bird: Transformers for Longer Sequences. arxiv: 2007.14062

Cited By

View all
  • (2024)Efficient Neural Ranking Using Forward Indexes and Lightweight EncodersACM Transactions on Information Systems10.1145/363193942:5(1-34)Online publication date: 29-Apr-2024
  • (2024)Clinical Trial Retrieval via Multi-grained Similarity LearningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661366(2950-2954)Online publication date: 10-Jul-2024
  • (2024)DUNKS: Chunking and Summarizing Large and Heterogeneous Data for Dataset SearchThe Semantic Web – ISWC 202410.1007/978-3-031-77850-6_5(78-97)Online publication date: 11-Nov-2024
  • Show More Cited By

Index Terms

  1. KeyBLD: Selecting Key Blocks with Local Pre-ranking for Long Document Information Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2021
    2998 pages
    ISBN:9781450380379
    DOI:10.1145/3404835
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bert-based models
    2. document representation for IR
    3. neural IR

    Qualifiers

    • Short-paper

    Funding Sources

    • MIAI@Grenoble Alpes
    • Chinese Scholarship Council (CSC)

    Conference

    SIGIR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficient Neural Ranking Using Forward Indexes and Lightweight EncodersACM Transactions on Information Systems10.1145/363193942:5(1-34)Online publication date: 29-Apr-2024
    • (2024)Clinical Trial Retrieval via Multi-grained Similarity LearningProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661366(2950-2954)Online publication date: 10-Jul-2024
    • (2024)DUNKS: Chunking and Summarizing Large and Heterogeneous Data for Dataset SearchThe Semantic Web – ISWC 202410.1007/978-3-031-77850-6_5(78-97)Online publication date: 11-Nov-2024
    • (2023)Extractive Explanations for Interpretable Text RankingACM Transactions on Information Systems10.1145/357692441:4(1-31)Online publication date: 23-Mar-2023
    • (2023)The Power of Selecting Key Blocks with Local Pre-ranking for Long Document Information RetrievalACM Transactions on Information Systems10.1145/356839441:3(1-35)Online publication date: 7-Feb-2023
    • (2023)Semantic matching based legal information retrieval system for COVID-19 pandemicArtificial Intelligence and Law10.1007/s10506-023-09354-x32:2(397-426)Online publication date: 14-Mar-2023
    • (2023)BERT-LBIA: A BERT-Based Late Bidirectional Interaction Attention Model for Legal Case RetrievalNeural Information Processing10.1007/978-981-99-8184-7_21(266-282)Online publication date: 26-Nov-2023
    • (2023)A Passage Retrieval Transformer-Based Re-Ranking Model for Truthful Consumer Health SearchMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43412-9_21(355-371)Online publication date: 18-Sep-2023
    • (2022)Efficient Neural Ranking using Forward IndexesProceedings of the ACM Web Conference 202210.1145/3485447.3511955(266-276)Online publication date: 25-Apr-2022
    • (2022)BERT-based Dense Intra-ranking and Contextualized Late Interaction via Multi-task Learning for Long Document RetrievalProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531856(2347-2352)Online publication date: 6-Jul-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media