skip to main content
10.1145/3462757.3466088acmconferencesArticle/Chapter ViewAbstractPublication PagesicailConference Proceedingsconference-collections
research-article
Open access

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

Published: 27 July 2021 Publication History

Abstract

While self-supervised learning has made rapid advances in natural language processing, it remains unclear when researchers should engage in resource-intensive domain-specific pretraining (domain pretraining). The law, puzzlingly, has yielded few documented instances of substantial gains to domain pretraining in spite of the fact that legal language is widely seen to be unique. We hypothesize that these existing results stem from the fact that existing legal NLP tasks are too easy and fail to meet conditions for when domain pretraining can help. To address this, we first present CaseHOLD (Case <u>H</u>oldings <u>O</u>n <u>L</u>egal <u>D</u>ecisions), a new dataset comprised of over 53,000+ multiple choice questions to identify the relevant holding of a cited case. This dataset presents a fundamental task to lawyers and is both legally meaningful and difficult from an NLP perspective (F1 of 0.4 with a BiLSTM baseline). Second, we assess performance gains on CaseHOLD and existing legal NLP datasets. While a Transformer architecture (BERT) pretrained on a general corpus (Google Books and Wikipedia) improves performance, domain pretraining (on a corpus of ≈3.5M decisions across all courts in the U.S. that is larger than BERT's) with a custom legal vocabulary exhibits the most substantial performance gains with CaseHOLD (gain of 7.2% on F1, representing a 12% improvement on BERT) and consistent performance gains across two other legal tasks. Third, we show that domain pretraining may be warranted when the task exhibits sufficient similarity to the pretraining corpus: the level of performance increase in three legal tasks was directly tied to the domain specificity of the task. Our findings inform when researchers should engage in resource-intensive pretraining and show that Transformer-based architectures, too, learn embeddings suggestive of distinct legal language.

References

[1]
Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preoţiuc-Pietro, and Vasileios Lampos. 2016. Predicting judicial decisions of the European Court of Human Rights: A natural language processing perspective. PeerJ Computer Science 2 (2016), e93.
[2]
Pablo D Arredondo. 2017. Harvesting and Utilizing Explanatory Parentheticals. SCL Rev. 69 (2017), 659.
[3]
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3615--3620.
[4]
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA, 610--623.
[5]
Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4317--4323. https://www.aclweb.org/anthology/P19-1424
[6]
Ilias Chalkidis, Ion Androutsopoulos, and Achilleas Michos. 2017. Extracting Contract Elements. In Proceedings of the 16th Edition of the International Conference on Articial Intelligence and Law (London, United Kingdom) (ICAIL '17). Association for Computing Machinery, New York, NY, USA, 19--28.
[7]
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 2898--2904. https://www.aclweb.org/anthology/2020.findings-emnlp.261
[8]
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, and Ion Androutsopoulos. 2019. Neural Contract Element Extraction Revisited. Workshop on Document Intelligence at NeurIPS 2019. https://openreview.net/forum?id=B1x6fa95UH
[9]
Columbia Law Review Ass'n, Harvard Law Review Ass'n, and Yale Law Journal. 2015. The Bluebook: A Uniform System of Citation (21st ed.). The Columbia Law Review, The Harvard Law Review, The University of Pennsylvania Law Review, and The Yale Law Journal.
[10]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. Pre-Training with Whole Word Masking for Chinese BERT. arXiv:1906.08101 [cs.CL]
[11]
Laura C. Dabney. 2008. Citators: Past, Present, and Future. Legal Reference Services Quarterly 27, 2--3 (2008), 165--190.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://www.aclweb.org/anthology/N19-1423
[13]
Pintip Hompluem Dunn. 2003. How judges overrule: Speech act theory and the doctrine of stare decisis. Yale LJ 113 (2003), 493.
[14]
Emad Elwany, Dave Moore, and Gaurav Oberoi. 2019. BERT Goes to Law School: Quantifying the Competitive Advantage of Access to Large Legal Corpora in Contract Understanding. arXiv:1911.00473 http://arxiv.org/abs/1911.00473
[15]
David Freeman Engstrom and Daniel E Ho. 2020. Algorithmic accountability in the administrative state. Yale J. on Reg. 37 (2020), 800.
[16]
David Freeman Engstrom, Daniel E. Ho, Catherine Sharkey, and Mariano-Florentino Cuéllar. 2020. Government by Algorithm: Artificial Intelligence in Federal Administrative Agencies. Administrative Conference of the United States, Washington DC, United States.
[17]
European Union 1993. Council Directive 93/13/EEC of 5 April 1993 on unfair terms in consumer contracts. European Union.
[18]
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2021. Aligning AI With Shared Human Values. arXiv:2008.02275 [cs.CY]
[19]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs.CY]
[20]
Michael J. Bommarito II, Daniel Martin Katz, and Eric M. Detterman. 2018. LexNLP: Natural language processing and information extraction for legal and regulatory texts. arXiv:1806.03688 http://arxiv.org/abs/1806.03688
[21]
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics 8 (2020), 64--77. https://www.aclweb.org/anthology/2020.tacl-1.5
[22]
David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. 2018. Measuring the Evolution of a Scientific Field through Citation Frames. Transactions of the Association for Computational Linguistics 6 (2018), 391--406. https://www.aclweb.org/anthology/Q18-1028
[23]
Minki Kang, Moonsu Han, and Sung Ju Hwang. 2020. Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6102--6120. https://www.aclweb.org/anthology/2020.emnlp-main.493
[24]
Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. arXiv:1808.06226 [cs.CL]
[25]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2019), 1234--1240.
[26]
Marco Lippi, Przemysław Pałka, Giuseppe Contissa, Francesca Lagioia, Hans-Wolfgang Micklitz, Giovanni Sartor, and Paolo Torroni. 2019. CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service. Artificial Intelligence and Law 27, 2 (2019), 117--139.
[27]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
[28]
David Mellinkoff. 2004. The language of the law. Wipf and Stock Publishers, Eugene, Oregon.
[29]
Elizabeth Mertz. 2007. The Language of Law School: Learning to "Think Like a Lawyer". Oxford University Press, USA.
[30]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. http://arxiv.org/abs/1301.3781
[31]
Octavia-Maria, Marcos Zampieri, Shervin Malmasi, Mihaela Vela, Liviu P. Dinu, and Josef van Genabith. 2017. Exploring the Use of Text Classification in the Legal Domain. Proceedings of 2nd Workshop on Automated Semantic Analysis of Information in Legal Texts (ASAIL).
[32]
Adam R. Pah, David L. Schwartz, Sarath Sanga, Zachary D. Clopton, Peter DiCola, Rachel Davis Mersey, Charlotte S. Alexander, Kristian J. Hammond, and Luís A. Nunes Amaral. 2020. How to build a more open justice system. Science 369, 6500 (2020), 134--136.
[33]
Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrQA: A Large Corpus for Question Answering on Electronic Medical Records. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2357--2368. https://www.aclweb.org/anthology/D18-1258
[34]
Marc Queudot, Éric Charton, and Marie-Jean Meurs. 2020. Improving Access to Justice with Legal Chatbots. Stats 3, 3 (2020), 356--375.
[35]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
[36]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383--2392. https://www.aclweb.org/anthology/D16-1264
[37]
Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics 7 (2019), 249--266.
[38]
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics 8 (2021), 842--866.
[39]
Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. 2020. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. arXiv:2012.15613 [cs.CL]
[40]
Jaromir Savelka, Vern R Walker, Matthias Grabmair, and Kevin D Ashley. 2017. Sentence boundary detection in adjudicatory decisions in the United States. Traitement automatique des langues 58 (2017), 21.
[41]
Or Sharir, Barak Peleg, and Yoav Shoham. 2020. The Cost of Training NLP Models: A Concise Overview. arXiv:2004.08900 [cs.CL]
[42]
Koustuv Sinha, Prasanna Parthasarathi, Joelle Pineau, and Adina Williams. 2020. Unnatural Language Inference. arXiv:2101.00010 [cs.CL]
[43]
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced Representation through Knowledge Integration. arXiv:1904.09223 [cs.CL]
[44]
P.M. Tiersma. 1999. Legal Language. University of Chicago Press, Chicago, Illinois. https://books.google.com/books?id=Sq8XXTo3A48C
[45]
George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R. Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artiéres, Axel-Cyrille Ngonga Ngomo, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. 2015. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16, 1 (April 2015), 138.
[46]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353--355. https://www.aclweb.org/anthology/W18-5446
[47]
Jonah Wu. 2019. AI Goes to Court: The Growing Landscape of AI for Access to Justice. https://medium.com/legal-design-and-innovation/ai-goes-to-court-the-growing-landscape-of-ai-for-access-to-justice-3f58aca4306f
[48]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [cs.CL]
[49]
Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5218--5230. https://www.aclweb.org/anthology/2020.acl-main.466
[50]
Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. JEC-QA: A Legal-Domain Question Answering Dataset., 9701-9708 pages.

Cited By

View all
  • (2025)Exploring LLMs Applications in Law: A Literature Review on Current Legal NLP ApproachesIEEE Access10.1109/ACCESS.2025.353321713(18253-18276)Online publication date: 2025
  • (2025)Reducing Judicial Inconsistency through AI: A Review of Legal Judgement Prediction ModelsITM Web of Conferences10.1051/itmconf/2025700200970(02009)Online publication date: 23-Jan-2025
  • (2025)Adaptive data augmentation for salient sentence identification in Indian judicial decisionsEvolving Systems10.1007/s12530-025-09671-316:2Online publication date: 23-Feb-2025
  • Show More Cited By

Index Terms

  1. When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ICAIL '21: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law
        June 2021
        319 pages
        ISBN:9781450385268
        DOI:10.1145/3462757
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 July 2021

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. benchmark dataset
        2. law
        3. natural language processing
        4. pretraining

        Qualifiers

        • Research-article

        Conference

        ICAIL '21
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 69 of 169 submissions, 41%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)821
        • Downloads (Last 6 weeks)89
        Reflects downloads up to 01 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)Exploring LLMs Applications in Law: A Literature Review on Current Legal NLP ApproachesIEEE Access10.1109/ACCESS.2025.353321713(18253-18276)Online publication date: 2025
        • (2025)Reducing Judicial Inconsistency through AI: A Review of Legal Judgement Prediction ModelsITM Web of Conferences10.1051/itmconf/2025700200970(02009)Online publication date: 23-Jan-2025
        • (2025)Adaptive data augmentation for salient sentence identification in Indian judicial decisionsEvolving Systems10.1007/s12530-025-09671-316:2Online publication date: 23-Feb-2025
        • (2025)An Efficient Graph-Based Summarization Approach for Judicial Case Type Prediction Using BERTInternational Conference on Systems and Technologies for Smart Agriculture10.1007/978-981-97-5157-0_56(677-687)Online publication date: 29-Jan-2025
        • (2025)Hybrid Classification of European Legislation Using Sustainable Development GoalsAIxIA 2024 – Advances in Artificial Intelligence10.1007/978-3-031-80607-0_9(105-118)Online publication date: 1-Jan-2025
        • (2024)Building and Leveraging Domain-specific Pre-trained Models to Support Japanese News Summarization日本語ニュース記事要約支援に向けたドメイン特化事前学習済みモデルの構築と活用Journal of Natural Language Processing10.5715/jnlp.31.171731:4(1717-1745)Online publication date: 2024
        • (2024)Semantic Shift Stability: Auditing Time-Series Performance Degradation of Pre-trained Models via Semantic Shift of Words in Training CorpusSemantic Shift Stability: 学習コーパス内の単語の意味変化を用いた事前学習済みモデルの時系列性能劣化の監査Journal of Natural Language Processing10.5715/jnlp.31.156331:4(1563-1597)Online publication date: 2024
        • (2024)A Survey on Challenges and Advances in Natural Language Processing with a Focus on Legal Informatics and Low-Resource LanguagesElectronics10.3390/electronics1303064813:3(648)Online publication date: 4-Feb-2024
        • (2024)Monitoring Sustainable Development Goals in European Legislation using Hybrid AIProceedings of the 17th International Conference on Theory and Practice of Electronic Governance10.1145/3680127.3680223(261-269)Online publication date: 1-Oct-2024
        • (2024)Measuring and Mitigating Gender Bias in Legal Contextualized Language ModelsACM Transactions on Knowledge Discovery from Data10.1145/362860218:4(1-26)Online publication date: 13-Feb-2024
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media