ABSTRACT
NLP in the legal domain has seen increasing success with the emergence of Transformer-based Pre-trained Language Models (PLMs) pre-trained on legal text. PLMs trained over European and US legal text are available publicly; however, legal text from other domains (countries), such as India, have a lot of distinguishing characteristics. With the rapidly increasing volume of Legal NLP applications in various countries, it has become necessary to pre-train such LMs over legal text of other countries as well. In this work, we attempt to investigate pre-training in the Indian legal domain. We re-train (continue pre-training) two popular legal PLMs, LegalBERT and CaseLawBERT, on Indian legal data, as well as train a model from scratch with a vocabulary based on Indian legal text. We apply these PLMs over three benchmark legal NLP tasks - Legal Statute Identification from facts, Semantic Segmentation of Court Judgment Documents, and Court Appeal Judgment Prediction - over both Indian and non-Indian (EU, UK) datasets. We observe that our approach not only enhances performance on the new domain (Indian texts) but also over the original domain (European and UK texts). We also conduct explainability experiments for a qualitative comparison of all these different PLMs.
- Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Proc. of EMNLP-IJCNLP.Google ScholarCross Ref
- Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv:2004.05150 (2020).Google Scholar
- Paheli Bhattacharya, Kaustubh Hiware, Subham Rajgaria, Nilay Pochhi, Kripabandhu Ghosh, and Saptarshi Ghosh. 2019. A Comparative Study of Summarization Algorithms Applied to Legal Case Judgments. In Proc. ECIR.Google ScholarDigital Library
- Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and Adam Wyner. 2019. Identification of Rhetorical Roles of Sentences in Indian Legal Judgments. In Proc. of JURIX.Google Scholar
- Paheli Bhattacharya, Shounak Paul, Kripabandhu Ghosh, Saptarshi Ghosh, and Adam Wyner. 2021. DeepRhole: deep learning for rhetorical role labeling of sentences in legal case documents. Artificial Intelligence and Law (11 2021), 1--38.Google Scholar
- Ilias Chalkidis et al. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proc. of ACL.Google Scholar
- Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. In Proc. of ACL.Google ScholarCross Ref
- Ilias Chalkidis, Emmanouil Fergadiotis, Prodromos Malakasiotis, and Ion Androut-sopoulos. 2019. Large-Scale Multi-Label Text Classification on EU Legislation. In Proc. of ACL.Google ScholarCross Ref
- Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. In Proc. of EMNLP.Google ScholarCross Ref
- Jacob Devlin et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of NAACL.Google Scholar
- Rotem Dror et al. 2018. The Hitchhiker's Guide to Testing Statistical Significance in Natural Language Processing. In Proc. of ACL. 1383--1392.Google Scholar
- Peter Henderson et al. 2022. Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset. (2022).Google Scholar
- Prathamesh Kalamkar, D. Janani Venugopalan Ph., and Vivek Raghavan Ph. D. 2021. Indian Legal NLP Benchmarks: A Survey. arXiv e-prints (2021).Google Scholar
- Daniel Katz, II Bommarito, and Josh Blackman. 2016. A General Approach for Predicting the Behavior of the Supreme Court of the United States. PLOS ONE 12 (2016).Google Scholar
- John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML.Google Scholar
- Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (2019).Google Scholar
- Quentin Lhoest et al. 2021. Datasets: A Community Library for Natural Language Processing. In Proc. of EMNLP: System Demonstrations. 175--184.Google Scholar
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019).Google Scholar
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In ICLR.Google Scholar
- Vijit Malik, Rishabh Sanjay, Shubham Kumar Nigam, Kripabandhu Ghosh, Shouvik Kumar Guha, Arnab Bhattacharya, and Ashutosh Modi. 2021. Indian Legal Documents Corpus for Court Judgment Prediction and Explanation. In Proc. ACL-IJCNLP.Google ScholarCross Ref
- Masha Medvedeva, Michel Vols, and Martijn Wieling. 2020. Using machine learning to predict decisions of the European Court of Human Rights. Artificial Intelligence and Law (2020), 237--266.Google Scholar
- Mahdi Naser Moghadasi and Yu Zhuang. 2020. Sent2Vec: A New Sentence Embedding Representation With Sentimental Semantic. In Proc. Big Data.Google ScholarCross Ref
- Adam Paszke et al. 2017. Automatic differentiation in PyTorch. In NIPS-W.Google Scholar
- Shounak Paul, Pawan Goyal, and Saptarshi Ghosh. 2022. LeSICiN: A Heterogeneous Graph-Based Approach for Automatic Legal Statute Identification from Indian Legal Documents. In Proc. of AAAI.Google ScholarCross Ref
- Pranav Rajpurkar et al. 2016. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. CoRR abs/1606.05250 (2016).Google Scholar
- Wilson L. Taylor. 1953. "Cloze Procedure": A New Tool for Measuring Readability. Journalism & Mass Communication Quarterly 30 (1953), 415 - 433.Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proc. NeurIPS.Google Scholar
- Alex Wang et al. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proc. of 2018 EMNLP Workshop BlackboxNLP.Google Scholar
- Thomas Wolf et al. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proc. of EMNLP: System Demonstrations. 38--45.Google Scholar
- Yonghui Wu et al. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs/1609.08144 (2016).Google Scholar
- Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. 2021. Lawformer: A pre-trained language model for Chinese legal long documents. AI Open 2 (2021), 79--84.Google ScholarCross Ref
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS.Google Scholar
- Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings. In Proc. of ICAIL.Google Scholar
- Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. In Proc. ACL.Google ScholarCross Ref
Index Terms
- Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law
Recommendations
SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information RetrievalLegal case retrieval, which aims to find relevant cases for a query case, plays a core role in the intelligent legal system. Despite the success that pre-training has achieved in ad-hoc retrieval tasks, effective pre-training strategies for legal case ...
On the transferability of pre-trained language models for low-resource programming languages
ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program ComprehensionA recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual Pre-trained Language Models (PLMs) achieves higher performance as opposed to using a corpus of code written in just one ...
Impact of Morphological Segmentation on Pre-trained Language Models
Intelligent SystemsAbstractPre-trained Language Models are the current state-of-the-art in many natural language processing tasks. These models rely on subword-based tokenization to solve the problem of out-of-vocabulary words. However, commonly used subword segmentation ...
Comments