Skip to main content

A Semantic Search System for the Supremo Tribunal de Justiça

  • Conference paper
  • First Online:
Progress in Artificial Intelligence (EPIA 2023)

Abstract

Many information retrieval systems use lexical approaches to retrieve information. Such approaches have multiple limitations, and these constraints are exacerbated when tied to specific domains, such as the legal one. Large language models, such as BERT, deeply understand a language and may overcome the limitations of older methodologies, such as BM25. This work investigated and developed a prototype of a Semantic Search System to assist the Supremo Tribunal de Justiça (Portuguese Supreme Court of Justice) in its decision-making process. We built a Semantic Search System that uses specially trained BERT models (Legal-BERTimbau variants) and a Hybrid Search System that incorporates both lexical and semantic techniques by combining the capabilities of BM25 and the potential of Legal-BERTimbau. In this context, we obtained a \(335\%\) increase on the discovery metric when compared to BM25 for the first query result. This work also provides information on the most relevant techniques for training a Large Language Model adapted to Portuguese jurisprudence and introduces a new technique of Metadata Knowledge Distillation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.elastic.co/.

  2. 2.

    https://huggingface.co/pierreguillou/t5-base-qa-squad-v1.1-portuguese.

  3. 3.

    https://www.ted.com/.

  4. 4.

    https://huggingface.co/sentence-transformers/stsb-roberta-large.

References

  1. Boos, R., Prestes, K., Villavicencio, A., Padró, M.: brWaC: A WaCky Corpus for Brazilian Portuguese. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.d.G. (eds.) Computational Processing of the Portuguese Language, pp. 201–206. Springer International Publishing, Cham (2014)

    Google Scholar 

  2. Choi, J., Jung, E., Suh, J., Rhee, W.: Improving bi-encoder document ranking models with two rankers and multi-teacher distillation. In: SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2192–2196. ACM, Canada (2021)

    Google Scholar 

  3. Cordeiro, N.: NLP Applied To Portuguese Consumer Law. Master’s thesis, Instituto Superior Técnico, Universidade de Lisboa (2022)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the ACL: Human Language Technologies, vol. 1, pp. 4171–4186. ACL, Minneapolis, Minnesota (2019)

    Google Scholar 

  5. Fonseca, E., Santos, L., Criscuolo, M., Aluisio, S.: ASSIN: Avaliacao de similaridade semantica e inferencia textual. In: Computational Processing of the Portuguese Language-12th International Conference, pp. 13–15. Tomar, Portugal (2016)

    Google Scholar 

  6. Kim, M., Rabelo, J., Goebel, R.: BM25 and transformer-based legal information extraction and entailment. In: Proceedings of the COLIEE Workshop in ICAIL (2021)

    Google Scholar 

  7. May, P.: Machine Translated Multilingual STS Benchmark Dataset (2021)

    Google Scholar 

  8. Nguyen, H.T., Vuong, H.Y.T., Nguyen, P.M., Dang, B.T., Bui, Q.M., Vu, S.T., Nguyen, C.M., Tran, V., Satoh, K., Nguyen, M.L.: JNLP team: Deep learning for legal processing in COLIEE 2020 (2020). arXiv:2011.08071

  9. OpenAI: GPT-4 technical report (2023)

    Google Scholar 

  10. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)

    MathSciNet  Google Scholar 

  11. Real, L., Fonseca, E., Oliveira, H.G.: The assin 2 shared task: a quick overview. In: International Conference on Computational Processing of the Portuguese Language, pp. 406–412. Springer, Berlin (2020)

    Google Scholar 

  12. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. ACL (2019)

    Google Scholar 

  13. Reimers, N., Gurevych, I.. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. ACL (2020)

    Google Scholar 

  14. Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. ACL (2020)

    Google Scholar 

  15. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr. 3(4), 333–389 (2009). https://doi.org/10.1561/1500000019

  16. Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRFv (2019). arXiv:1909.10649

  17. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) Intelligent Systems, pp. 403–417. Springer International Publishing, Cham (2020)

    Google Scholar 

  18. Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)

    Google Scholar 

  19. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: LLaMA: Open and efficient foundation language models (2023)

    Google Scholar 

  20. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017)

    Google Scholar 

  21. Wang, K., Reimers, N., Gurevych, I.: TSDAE: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. In: Findings of the ACL: EMNLP 2021, pp. 671–688. ACL, Punta Cana, Dominican Republic (2021)

    Google Scholar 

  22. Wang, K., Thakur, N., Reimers, N., Gurevych, I.: GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In: North American Chapter of the ACL (2021)

    Google Scholar 

  23. Zhelezniak, V., Savkov, A., Shen, A., Hammerla, N.: Correlation coefficients and semantic textual similarity. In: Proceedings of the 2019 Conference of the North American Chapter of the ACL: Human Language Technologies, vol 1, pp. 951–962. ACL, Minneapolis, Minnesota (2019)

    Google Scholar 

Download references

Acknowledgements

The presented work was done as part of INESC-ID’s project “Sumarização e Informação de decisões: Aplicação de Técnicas de Inteligência Artificial no Supremo Tribunal de Justiça" (IRIS), in collaboration with STJ. This work was partially supported by STJ and by national funds through Fundação para a Ciência e a Tecnologia (FCT) through projects UIDB/50021/2020, UIDB/04326/2020, UIDP/04326/2020 and LA/P/0101/2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Melo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Melo, R., Santos, P.A., Dias, J. (2023). A Semantic Search System for the Supremo Tribunal de Justiça. In: Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R. (eds) Progress in Artificial Intelligence. EPIA 2023. Lecture Notes in Computer Science(), vol 14116. Springer, Cham. https://doi.org/10.1007/978-3-031-49011-8_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-49011-8_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-49010-1

  • Online ISBN: 978-3-031-49011-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics