research-article

Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law

Authors:
Shounak Paul

Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India

Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India

0000-0001-8124-4338
View Profile

,
Arpan Mandal

Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India

Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India

0000-0001-8376-429X
View Profile

,
Pawan Goyal

Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India

Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India

0000-0002-9414-8166
View Profile

,
Saptarshi Ghosh

Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India

Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India

0000-0002-2306-300X
View Profile

ICAIL '23: Proceedings of the Nineteenth International Conference on Artificial Intelligence and LawJune 2023Pages 187–196https://doi.org/10.1145/3594536.3595165

Published:07 September 2023Publication History

ICAIL '23: Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law

Pages 187–196

ABSTRACT

NLP in the legal domain has seen increasing success with the emergence of Transformer-based Pre-trained Language Models (PLMs) pre-trained on legal text. PLMs trained over European and US legal text are available publicly; however, legal text from other domains (countries), such as India, have a lot of distinguishing characteristics. With the rapidly increasing volume of Legal NLP applications in various countries, it has become necessary to pre-train such LMs over legal text of other countries as well. In this work, we attempt to investigate pre-training in the Indian legal domain. We re-train (continue pre-training) two popular legal PLMs, LegalBERT and CaseLawBERT, on Indian legal data, as well as train a model from scratch with a vocabulary based on Indian legal text. We apply these PLMs over three benchmark legal NLP tasks - Legal Statute Identification from facts, Semantic Segmentation of Court Judgment Documents, and Court Appeal Judgment Prediction - over both Indian and non-Indian (EU, UK) datasets. We observe that our approach not only enhances performance on the new domain (Indian texts) but also over the original domain (European and UK texts). We also conduct explainability experiments for a qualitative comparison of all these different PLMs.

References

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Proc. of EMNLP-IJCNLP.Google ScholarCross Ref
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv:2004.05150 (2020).Google Scholar
Paheli Bhattacharya, Kaustubh Hiware, Subham Rajgaria, Nilay Pochhi, Kripabandhu Ghosh, and Saptarshi Ghosh. 2019. A Comparative Study of Summarization Algorithms Applied to Legal Case Judgments. In Proc. ECIR.Google ScholarDigital Library
Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and Adam Wyner. 2019. Identification of Rhetorical Roles of Sentences in Indian Legal Judgments. In Proc. of JURIX.Google Scholar
Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and Adam Wyner. 2021. DeepRhole: deep learning for rhetorical role labeling of sentences in legal case documents. Artificial Intelligence and Law (11 2021), 1--38.Google Scholar
Ilias Chalkidis et al. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proc. of ACL.Google Scholar
Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. In Proc. of ACL.Google ScholarCross Ref
Ilias Chalkidis, Emmanouil Fergadiotis, Prodromos Malakasiotis, and Ion Androut-sopoulos. 2019. Large-Scale Multi-Label Text Classification on EU Legislation. In Proc. of ACL.Google ScholarCross Ref
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. In Proc. of EMNLP.Google ScholarCross Ref
Jacob Devlin et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of NAACL.Google Scholar
Rotem Dror et al. 2018. The Hitchhiker's Guide to Testing Statistical Significance in Natural Language Processing. In Proc. of ACL. 1383--1392.Google Scholar
Peter Henderson et al. 2022. Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset. (2022).Google Scholar
Prathamesh Kalamkar, D. Janani Venugopalan Ph., and Vivek Raghavan Ph. D. 2021. Indian Legal NLP Benchmarks: A Survey. arXiv e-prints (2021).Google Scholar
Daniel Katz, II Bommarito, and Josh Blackman. 2016. A General Approach for Predicting the Behavior of the Supreme Court of the United States. PLOS ONE 12 (2016).Google Scholar
John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML.Google Scholar
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (2019).Google Scholar
Quentin Lhoest et al. 2021. Datasets: A Community Library for Natural Language Processing. In Proc. of EMNLP: System Demonstrations. 175--184.Google Scholar
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019).Google Scholar
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In ICLR.Google Scholar
Vijit Malik, Rishabh Sanjay, Shubham Kumar Nigam, Kripabandhu Ghosh, Shouvik Kumar Guha, Arnab Bhattacharya, and Ashutosh Modi. 2021. Indian Legal Documents Corpus for Court Judgment Prediction and Explanation. In Proc. ACL-IJCNLP.Google ScholarCross Ref
Masha Medvedeva, Michel Vols, and Martijn Wieling. 2020. Using machine learning to predict decisions of the European Court of Human Rights. Artificial Intelligence and Law (2020), 237--266.Google Scholar
Mahdi Naser Moghadasi and Yu Zhuang. 2020. Sent2Vec: A New Sentence Embedding Representation With Sentimental Semantic. In Proc. Big Data.Google ScholarCross Ref
Adam Paszke et al. 2017. Automatic differentiation in PyTorch. In NIPS-W.Google Scholar
Shounak Paul, Pawan Goyal, and Saptarshi Ghosh. 2022. LeSICiN: A Heterogeneous Graph-Based Approach for Automatic Legal Statute Identification from Indian Legal Documents. In Proc. of AAAI.Google ScholarCross Ref
Pranav Rajpurkar et al. 2016. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. CoRR abs/1606.05250 (2016).Google Scholar
Wilson L. Taylor. 1953. "Cloze Procedure": A New Tool for Measuring Readability. Journalism & Mass Communication Quarterly 30 (1953), 415 - 433.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proc. NeurIPS.Google Scholar
Alex Wang et al. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proc. of 2018 EMNLP Workshop BlackboxNLP.Google Scholar
Thomas Wolf et al. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proc. of EMNLP: System Demonstrations. 38--45.Google Scholar
Yonghui Wu et al. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs/1609.08144 (2016).Google Scholar
Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. 2021. Lawformer: A pre-trained language model for Chinese legal long documents. AI Open 2 (2021), 79--84.Google ScholarCross Ref
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS.Google Scholar
Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings. In Proc. of ICAIL.Google Scholar
Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. In Proc. ACL.Google ScholarCross Ref

Index Terms

Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law
1. Applied computing
  1. Law, social and behavioral sciences
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Legal case retrieval, which aims to find relevant cases for a query case, plays a core role in the intelligent legal system. Despite the success that pre-training has achieved in ad-hoc retrieval tasks, effective pre-training strategies for legal case ...
Read More
On the transferability of pre-trained language models for low-resource programming languages
ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension

A recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual Pre-trained Language Models (PLMs) achieves higher performance as opposed to using a corpus of code written in just one ...
Read More
Impact of Morphological Segmentation on Pre-trained Language Models
Intelligent Systems
Abstract
Pre-trained Language Models are the current state-of-the-art in many natural language processing tasks. These models rely on subword-based tokenization to solve the problem of out-of-vocabulary words. However, commonly used subword segmentation ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICAIL '23: Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law
June 2023
499 pages
ISBN:9798400701979
DOI:10.1145/3594536
Conference Chair:
Francisco Andrade
University of Minho, Portugal
,
Program Chair:
Matthias Grabmair
Technical University of Munich, Germany
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 September 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
legal domain pre-training
pre-trained language models
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate69of169submissions,41%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 155
  Total Downloads
- Downloads (Last 12 months)155
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law

ICAIL '23: Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law

ABSTRACT

References

Cited By

Index Terms

Recommendations

SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval

On the transferability of pre-trained language models for low-resource programming languages

Impact of Morphological Segmentation on Pre-trained Language Models