Abstract
In this paper, we introduce our approaches using Transformer-based models for different problems of the COLIEE 2021 automatic legal text processing competition. Automated processing of legal documents is a challenging task because of the characteristics of legal documents as well as the limitation of the amount of data. With our detailed experiments, we found that Transformer-based pretrained language models can perform well with automated legal text-processing problems with appropriate approaches. We describe in detail the processing steps for each task such as problem formulation, data processing and augmentation, pretraining, finetuning. In addition, we introduce to the community two pretrained models that take advantage of parallel translations in legal domain, NFSP and NMSP. In which, NFSP achieves the state-of-the-art result in Task 5 of the competition. Although the paper focuses on technical reporting, the novelty of its approaches can also be an useful reference in automated legal document processing using Transformer-based models.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, pp. 5998–6008.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT, pp. 4171–4186.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). Albert: A lite bert for self-supervised learning of language representations. International Conference on Learning Representations.
Clark, K., Luong, M.-T., Le, Q. V, & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. International Conference on Learning Representations.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Association for Computational Linguistics, pp. 7871–7880.
Radford, A., Jeffrey, W., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, pp. 1877–1901.
Kowalski, R., & Datoo, A. (2021). Logical English meets legal English for swaps and derivatives. Artificial Intelligence and Law. https://doi.org/10.1007/s10506-021-09295-3.
Satoh, K., Asai, K., Kogawa, T., Kubota, M., Nakamura, M., Nishigai, Y., et al. (2010). Proleg: an implementation of the presupposed ultimate fact theory of Japanese civil code by prolog technology. JSAI international symposium on artificial intelligence (pp. 153–164). Berlin: Springer.
Cooper, W. S. (1971). A definition of relevance for information retrieval. Information Storage and Retrieval, 7(1), 19–37.
Hans, P. L. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4), 309–317.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing Management, 24(5), 513–523.
Sugathadasa, K., Ayesha, B., de Silva, N., Perera, A. S., & Jayawardana, V. (2018). Legal document retrieval using document vector embeddings and deep learning. In L. Dimuthu & P. Madhavi (Eds.), Science and information conference (pp. 160–175). Springer.
Kien, P. M., Nguyen, H.-T., Bach, N. X., Tran, V., Nguyen, M. L., & Phuong, T. M. (2020). Answering legal questions by learning neural attentive text representation. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain, pp. 988–998.
Tran, V., Nguyen, M. L., & Satoh, K. (2019). Building legal case retrieval systems with lexical matching and summarization using a pre-trained phrase scoring model. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law, pp. 275–282.
Rabelo, J., Kim, M.-Y., & Goebel, R. (2019). Combining similarity and transformer methods for case law entailment. Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law, 19, 290–296.
Nguyen, H.-T., Vuong, H.-Y. T., Nguyen, . M., Dang, B. T., Bui, Q. M., Vu, S. T., Nguyen, C. M., Tran, V., Satoh, K., & Nguyen, M. L. (2020). Jnlp team: Deep learning for legal processing in coliee 2020. Proceedings of the International Workshop on Juris-Informatics, pp. 195–208.
Dang, T.B., Nguyen, T., & Nguyen, L. M. (2019). An approach to statute law retrieval task in coliee-2019. Proceedings of the 6th Competition on Legal Information Extraction/Entailment. COLIEE.
Wehnert, S., Hoque, S.A., Fenske, W., & Saake, G. (2019). Threshold-based retrieval and textual entailment detection on legal bar exam questions. Proceedings of the 6th Competition on Legal Information Extraction/Entailment. COLIEE.
Gain, B., Bandyopadhyay, D., Saikh, T., & Ekbal, A. (2019). Iitp@coliee 2019: Legal information retrieval using bm25 and bert. Proceedings of the 6th Competition on Legal Information Extraction/Entailment. COLIEE.
Hayashi, R., & Kano, Y. (2019). Searching relevant articles for legal bar exam by doc2vec and tf-idf. Proceedings of the 6th Competition on Legal Information Extraction/Entailment. COLIEE.
Hoshino, R., Kiyota, N., & Kano, Y. (2019). Question answering system for legal bar examination using predicate argument structures focusing on exceptions. Proceedings of the 6th Competition on Legal Information Extraction/Entailment. COLIEE.
Hudzina, J., Vacek, T., Madan, K., Tonya, C., & Schilder, F. (2019). Statutory entailment using similarity features and decomposable attention models. Proceedings of the 6th Competition on Legal Information Extraction/Entailment. COLIEE.
Nguyen, H. T., Tran, V., & Nguyen, L. M. (2019). A deep learning approach for statute law entailment task in coliee-2019. Proceedings of the 6th Competition on Legal Information Extraction/Entailment. COLIEE.
Triguero, I., García, S., & Herrera, F. (2015). Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowledge and Information Systems, 42(2), 245–284.
Nguyen, H.-T., Tran, V., Nguyen, P. M., Vuong, T.-H.-Y., Bui, Q. M., Nguyen, C. M., Dang, B. T., Nguyen, M. L., & Satoh, K. (2021). Paralaw nets–cross-lingual sentence-level pretraining for legal text processing. Proceedings of the 8th Competition on Legal Information Extraction/Entailment. COLIEE.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L. & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. Association for Computational Linguistics, pp. 8440–8451.
Acknowledgements
This work was supported by JSPS Kakenhi Grant Number 20H04295, 20K20406, and 20K20625. The research also was supported in part by the Asian Office of Aerospace R&D (AOARD), Air Force Office of Scientific Research (Grant no. FA2386-19-1-4041).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Nguyen, HT., Nguyen, MP., Vuong, THY. et al. Transformer-Based Approaches for Legal Text Processing. Rev Socionetwork Strat 16, 135–155 (2022). https://doi.org/10.1007/s12626-022-00102-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12626-022-00102-2