Skip to main content

HIRS: A Hybrid Information Retrieval System for Legislative Documents

  • Conference paper
  • First Online:
Progress in Artificial Intelligence (EPIA 2024)

Abstract

The use of Transformers for text processing has attracted a large deal of attention in the last years. This is particularly true for sentence models, which present high capacity to comprehend and generate text contextually, improving the predictive performance in different Natural Language Processing tasks, when compared with previous approaches. Even so, there are still several challenges when applied to long documents, especially for some knowledge areas with very specific characteristics, such as legislative proposals. This study investigated different strategies for utilizing BERT-based models in long document retrieval written in Brazilian Portuguese. We used three corpora from the Brazilian Chamber of Deputies to build a dataset and assess the models, incorporating zero-shot and fine-tuning strategies. Five sentence models were evaluated: BERTimbau, LegalBert, LegalBert-pt, LegalBERTimbau, and LaBSE. We also assessed a summarized corpus of bills considering the input size limitation of the sentence models. Finaly, we propose a hybrid model, named HIRS, combining BM25 and BERTimbau with fine-tuning. According to the experimental results, the predictive performance obtained by HIRS was superior to the performance obtained by the other models, with a Recall of 84.78% for 20 documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://huggingface.co/meta-llama/Llama-2-13b-chat-hf.

  2. 2.

    Translated version of: “Sumarize o texto a seguir. Retorne o sumário em um único parágrafo abrangendo os pontos principais que foram identificados no texto: \(\backslash \)n <texto_original> \(\backslash \)n RESUMO:”.

  3. 3.

    huggingface.co/.

  4. 4.

    https://huggingface.co/neuralmind/bert-large-portuguese-cased.

  5. 5.

    https://huggingface.co/ulysses-camara/legal-bert-pt-br.

  6. 6.

    https://huggingface.co/raquelsilveira/legalbertpt_fp.

  7. 7.

    https://huggingface.co/rufimelo/Legal-BERTimbau-large.

  8. 8.

    https://huggingface.co/sentence-transformers/LaBSE.

References

  1. Bast, H., Buchhold, B., Haussmann, E.: Semantic search on text and knowledge bases. Found. Trends® Inf. Retrieval 10(2-3), 119–271 (2016)

    Google Scholar 

  2. Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., Androutsopoulos, I.: LEGAL-BERT: the muppets straight out of law school. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2898–2904. Association for Computational Linguistics, November 2020

    Google Scholar 

  3. Cordeiro, N.P., Dias, J., Santos, P.A.: LeSSE-a semantic search engine applied to portuguese consumer law. In: Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R. (eds.) Progress in Artificial Intelligence, EPIA 2023, LNCS, vol. 14116, pp. 118–130. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-49011-8_10

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2019)

    Google Scholar 

  5. Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic BERT sentence embedding. CoRR abs/2007.01852 (2020)

    Google Scholar 

  6. da Fonseca, G.H.G.: Recuperação de informação (2020)

    Google Scholar 

  7. Gomes, T., Ladeira, M.: A new conceptual framework for enhancing legal information retrieval at the brazilian superior court of justice. In: Proceedings of the 12th International Conference on Management of Digital EcoSystems, MEDES 2020, pp. 26–29. Association for Computing Machinery, New York, NY, USA (2020)

    Google Scholar 

  8. José, M.M., José, M.A., Mauá, D.D., Cozman, F.G.: Integrating question answering and text-to-SQL in Portuguese. In: Pinheiro, V., et al. (eds.) Computational Processing of the Portuguese Language, pp. 278–287. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-030-98305-5_26

    Chapter  Google Scholar 

  9. Kamphuis, C., de Vries, A.P., Boytsov, L., Lin, J.: Which bm25 do you mean? a large-scale reproducibility study of scoring variants. In: Jose, J.M., et al. (eds.) Advances in Information Retrieval, pp. 28–34. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_4

    Chapter  Google Scholar 

  10. Lee, H.D., Lee, S., Kang, U.: Auber: automated bert regularization. PLOS ONE 16(6), 1–16 (2021)

    Google Scholar 

  11. Lin, J., Nogueira, R., Yates, A.: Pretrained transformers for text ranking: bert and beyond (2021)

    Google Scholar 

  12. Melo, R., Santos, P.A., Dias, J.: A semantic search system for the supremo tribunal de justiça. In: Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R. (eds.) Progress in Artificial Intelligence, pp. 142–154. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-49011-8_12

    Chapter  Google Scholar 

  13. Min, B., et al.: Recent advances in natural language processing via large pre-trained language models: a survey. CoRR abs/2111.01243 (2021)

    Google Scholar 

  14. Paul, S., Mandal, A., Goyal, P., Ghosh, S.: Pre-trained language models for the legal domain: A case study on Indian law. In: Proceedings of 19th International Conference on Artificial Intelligence and Law - ICAIL 2023 (2023)

    Google Scholar 

  15. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks, August 2019

    Google Scholar 

  16. Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: Bm25 and beyond. Found. Trends® Inf. Retrieval 3(4), 333–389 (2009)

    Google Scholar 

  17. Rosa, G.M., Rodrigues, R.C., de Alencar Lotufo, R., Nogueira, R.: To tune or not to tune? zero-shot models for legal case entailment. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, ICAIL 2021, pp. 295-300. Association for Computing Machinery, New York, NY, USA (2021)

    Google Scholar 

  18. Savelka, J.: Discovering sentences for argumentation about the meaning of statutory terms, August 2020

    Google Scholar 

  19. Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval, vol. 39. Cambridge University Press, Cambridge (2008)

    Google Scholar 

  20. Silva, N., et al.: Evaluating topic models in portuguese political comments about bills from brazil’s chamber of deputies. In: Anais da X Brazilian Conference on Intelligent Systems. SBC, Porto Alegre, RS, Brasil (2021)

    Google Scholar 

  21. Silveira, R., Ponte, C., Almeida, V., Pinheiro, V., Furtado, V.: LegalBert-PT: A pretrained language model for the Brazilian Portuguese legal domain. In: Naldi, M.C., Bianchi, R.A.C. (eds.) Intelligent Systems, pp. 268–282. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-45392-2_18

    Chapter  Google Scholar 

  22. Souza, E., et al.: An information retrieval pipeline for legislative documents from the Brazilian chamber of deputies, vol. 346, pp. 119–126. IOS Press BV, December 2021

    Google Scholar 

  23. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28

    Chapter  Google Scholar 

  24. Tüselmann, O., Fink, G.A.: Exploring semantic word representations for recognition-free NLP on handwritten document images. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) Document Analysis and Recognition - ICDAR 2023, ICDAR 2023, LNCS, vol. 14190, pp. 85–100. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41685-9_6

  25. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  26. Yang, Y., Wu, Z., Yang, Y., Lian, S., Guo, F., Wang, Z.: A survey of information extraction based on deep learning. Appl. Sci. 12(19), 9691 (2022)

    Google Scholar 

  27. Zhang, Y., Li, X., Zhang, Z.: Disease-pertinent knowledge extraction in online health communities using GRU based on a double attention mechanism. IEEE Access 8, 95947–95955 (2020)

    Article  Google Scholar 

Download references

Acknowledgement

This research was financed in part by CAPES (Brazil) and National Institute of Artificial Intelligence (IAIA), using computational resources provided by CeMEAI (funded by FAPESP grant 2013/07375-0).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José Antônio dos Santos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

dos Santos, J.A. et al. (2025). HIRS: A Hybrid Information Retrieval System for Legislative Documents. In: Santos, M.F., Machado, J., Novais, P., Cortez, P., Moreira, P.M. (eds) Progress in Artificial Intelligence. EPIA 2024. Lecture Notes in Computer Science(), vol 14967. Springer, Cham. https://doi.org/10.1007/978-3-031-73497-7_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73497-7_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73496-0

  • Online ISBN: 978-3-031-73497-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics