skip to main content
10.1145/3594536.3595165acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicailConference Proceedingsconference-collections
research-article

Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law

Published:07 September 2023Publication History

ABSTRACT

NLP in the legal domain has seen increasing success with the emergence of Transformer-based Pre-trained Language Models (PLMs) pre-trained on legal text. PLMs trained over European and US legal text are available publicly; however, legal text from other domains (countries), such as India, have a lot of distinguishing characteristics. With the rapidly increasing volume of Legal NLP applications in various countries, it has become necessary to pre-train such LMs over legal text of other countries as well. In this work, we attempt to investigate pre-training in the Indian legal domain. We re-train (continue pre-training) two popular legal PLMs, LegalBERT and CaseLawBERT, on Indian legal data, as well as train a model from scratch with a vocabulary based on Indian legal text. We apply these PLMs over three benchmark legal NLP tasks - Legal Statute Identification from facts, Semantic Segmentation of Court Judgment Documents, and Court Appeal Judgment Prediction - over both Indian and non-Indian (EU, UK) datasets. We observe that our approach not only enhances performance on the new domain (Indian texts) but also over the original domain (European and UK texts). We also conduct explainability experiments for a qualitative comparison of all these different PLMs.

References

  1. Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Proc. of EMNLP-IJCNLP.Google ScholarGoogle ScholarCross RefCross Ref
  2. Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv:2004.05150 (2020).Google ScholarGoogle Scholar
  3. Paheli Bhattacharya, Kaustubh Hiware, Subham Rajgaria, Nilay Pochhi, Kripabandhu Ghosh, and Saptarshi Ghosh. 2019. A Comparative Study of Summarization Algorithms Applied to Legal Case Judgments. In Proc. ECIR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and Adam Wyner. 2019. Identification of Rhetorical Roles of Sentences in Indian Legal Judgments. In Proc. of JURIX.Google ScholarGoogle Scholar
  5. Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and Adam Wyner. 2021. DeepRhole: deep learning for rhetorical role labeling of sentences in legal case documents. Artificial Intelligence and Law (11 2021), 1--38.Google ScholarGoogle Scholar
  6. Ilias Chalkidis et al. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proc. of ACL.Google ScholarGoogle Scholar
  7. Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. In Proc. of ACL.Google ScholarGoogle ScholarCross RefCross Ref
  8. Ilias Chalkidis, Emmanouil Fergadiotis, Prodromos Malakasiotis, and Ion Androut-sopoulos. 2019. Large-Scale Multi-Label Text Classification on EU Legislation. In Proc. of ACL.Google ScholarGoogle ScholarCross RefCross Ref
  9. Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. In Proc. of EMNLP.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jacob Devlin et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of NAACL.Google ScholarGoogle Scholar
  11. Rotem Dror et al. 2018. The Hitchhiker's Guide to Testing Statistical Significance in Natural Language Processing. In Proc. of ACL. 1383--1392.Google ScholarGoogle Scholar
  12. Peter Henderson et al. 2022. Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset. (2022).Google ScholarGoogle Scholar
  13. Prathamesh Kalamkar, D. Janani Venugopalan Ph., and Vivek Raghavan Ph. D. 2021. Indian Legal NLP Benchmarks: A Survey. arXiv e-prints (2021).Google ScholarGoogle Scholar
  14. Daniel Katz, II Bommarito, and Josh Blackman. 2016. A General Approach for Predicting the Behavior of the Supreme Court of the United States. PLOS ONE 12 (2016).Google ScholarGoogle Scholar
  15. John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML.Google ScholarGoogle Scholar
  16. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (2019).Google ScholarGoogle Scholar
  17. Quentin Lhoest et al. 2021. Datasets: A Community Library for Natural Language Processing. In Proc. of EMNLP: System Demonstrations. 175--184.Google ScholarGoogle Scholar
  18. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019).Google ScholarGoogle Scholar
  19. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In ICLR.Google ScholarGoogle Scholar
  20. Vijit Malik, Rishabh Sanjay, Shubham Kumar Nigam, Kripabandhu Ghosh, Shouvik Kumar Guha, Arnab Bhattacharya, and Ashutosh Modi. 2021. Indian Legal Documents Corpus for Court Judgment Prediction and Explanation. In Proc. ACL-IJCNLP.Google ScholarGoogle ScholarCross RefCross Ref
  21. Masha Medvedeva, Michel Vols, and Martijn Wieling. 2020. Using machine learning to predict decisions of the European Court of Human Rights. Artificial Intelligence and Law (2020), 237--266.Google ScholarGoogle Scholar
  22. Mahdi Naser Moghadasi and Yu Zhuang. 2020. Sent2Vec: A New Sentence Embedding Representation With Sentimental Semantic. In Proc. Big Data.Google ScholarGoogle ScholarCross RefCross Ref
  23. Adam Paszke et al. 2017. Automatic differentiation in PyTorch. In NIPS-W.Google ScholarGoogle Scholar
  24. Shounak Paul, Pawan Goyal, and Saptarshi Ghosh. 2022. LeSICiN: A Heterogeneous Graph-Based Approach for Automatic Legal Statute Identification from Indian Legal Documents. In Proc. of AAAI.Google ScholarGoogle ScholarCross RefCross Ref
  25. Pranav Rajpurkar et al. 2016. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. CoRR abs/1606.05250 (2016).Google ScholarGoogle Scholar
  26. Wilson L. Taylor. 1953. "Cloze Procedure": A New Tool for Measuring Readability. Journalism & Mass Communication Quarterly 30 (1953), 415 - 433.Google ScholarGoogle Scholar
  27. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proc. NeurIPS.Google ScholarGoogle Scholar
  28. Alex Wang et al. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proc. of 2018 EMNLP Workshop BlackboxNLP.Google ScholarGoogle Scholar
  29. Thomas Wolf et al. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proc. of EMNLP: System Demonstrations. 38--45.Google ScholarGoogle Scholar
  30. Yonghui Wu et al. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs/1609.08144 (2016).Google ScholarGoogle Scholar
  31. Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. 2021. Lawformer: A pre-trained language model for Chinese legal long documents. AI Open 2 (2021), 79--84.Google ScholarGoogle ScholarCross RefCross Ref
  32. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS.Google ScholarGoogle Scholar
  33. Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings. In Proc. of ICAIL.Google ScholarGoogle Scholar
  34. Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. In Proc. ACL.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          ICAIL '23: Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law
          June 2023
          499 pages
          ISBN:9798400701979
          DOI:10.1145/3594536

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 September 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate69of169submissions,41%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader