Abstract
Many information retrieval systems use lexical approaches to retrieve information. Such approaches have multiple limitations, and these constraints are exacerbated when tied to specific domains, such as the legal one. Large language models, such as BERT, deeply understand a language and may overcome the limitations of older methodologies, such as BM25. This work investigated and developed a prototype of a Semantic Search System to assist the Supremo Tribunal de Justiça (Portuguese Supreme Court of Justice) in its decision-making process. We built a Semantic Search System that uses specially trained BERT models (Legal-BERTimbau variants) and a Hybrid Search System that incorporates both lexical and semantic techniques by combining the capabilities of BM25 and the potential of Legal-BERTimbau. In this context, we obtained a \(335\%\) increase on the discovery metric when compared to BM25 for the first query result. This work also provides information on the most relevant techniques for training a Large Language Model adapted to Portuguese jurisprudence and introduces a new technique of Metadata Knowledge Distillation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Boos, R., Prestes, K., Villavicencio, A., Padró, M.: brWaC: A WaCky Corpus for Brazilian Portuguese. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.d.G. (eds.) Computational Processing of the Portuguese Language, pp. 201–206. Springer International Publishing, Cham (2014)
Choi, J., Jung, E., Suh, J., Rhee, W.: Improving bi-encoder document ranking models with two rankers and multi-teacher distillation. In: SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2192–2196. ACM, Canada (2021)
Cordeiro, N.: NLP Applied To Portuguese Consumer Law. Master’s thesis, Instituto Superior Técnico, Universidade de Lisboa (2022)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the ACL: Human Language Technologies, vol. 1, pp. 4171–4186. ACL, Minneapolis, Minnesota (2019)
Fonseca, E., Santos, L., Criscuolo, M., Aluisio, S.: ASSIN: Avaliacao de similaridade semantica e inferencia textual. In: Computational Processing of the Portuguese Language-12th International Conference, pp. 13–15. Tomar, Portugal (2016)
Kim, M., Rabelo, J., Goebel, R.: BM25 and transformer-based legal information extraction and entailment. In: Proceedings of the COLIEE Workshop in ICAIL (2021)
May, P.: Machine Translated Multilingual STS Benchmark Dataset (2021)
Nguyen, H.T., Vuong, H.Y.T., Nguyen, P.M., Dang, B.T., Bui, Q.M., Vu, S.T., Nguyen, C.M., Tran, V., Satoh, K., Nguyen, M.L.: JNLP team: Deep learning for legal processing in COLIEE 2020 (2020). arXiv:2011.08071
OpenAI: GPT-4 technical report (2023)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
Real, L., Fonseca, E., Oliveira, H.G.: The assin 2 shared task: a quick overview. In: International Conference on Computational Processing of the Portuguese Language, pp. 406–412. Springer, Berlin (2020)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. ACL (2019)
Reimers, N., Gurevych, I.. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. ACL (2020)
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. ACL (2020)
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr. 3(4), 333–389 (2009). https://doi.org/10.1561/1500000019
Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRFv (2019). arXiv:1909.10649
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) Intelligent Systems, pp. 403–417. Springer International Publishing, Cham (2020)
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: LLaMA: Open and efficient foundation language models (2023)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017)
Wang, K., Reimers, N., Gurevych, I.: TSDAE: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. In: Findings of the ACL: EMNLP 2021, pp. 671–688. ACL, Punta Cana, Dominican Republic (2021)
Wang, K., Thakur, N., Reimers, N., Gurevych, I.: GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In: North American Chapter of the ACL (2021)
Zhelezniak, V., Savkov, A., Shen, A., Hammerla, N.: Correlation coefficients and semantic textual similarity. In: Proceedings of the 2019 Conference of the North American Chapter of the ACL: Human Language Technologies, vol 1, pp. 951–962. ACL, Minneapolis, Minnesota (2019)
Acknowledgements
The presented work was done as part of INESC-ID’s project “Sumarização e Informação de decisões: Aplicação de Técnicas de Inteligência Artificial no Supremo Tribunal de Justiça" (IRIS), in collaboration with STJ. This work was partially supported by STJ and by national funds through Fundação para a Ciência e a Tecnologia (FCT) through projects UIDB/50021/2020, UIDB/04326/2020, UIDP/04326/2020 and LA/P/0101/2020.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Melo, R., Santos, P.A., Dias, J. (2023). A Semantic Search System for the Supremo Tribunal de Justiça. In: Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R. (eds) Progress in Artificial Intelligence. EPIA 2023. Lecture Notes in Computer Science(), vol 14116. Springer, Cham. https://doi.org/10.1007/978-3-031-49011-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-49011-8_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49010-1
Online ISBN: 978-3-031-49011-8
eBook Packages: Computer ScienceComputer Science (R0)